-
Notifications
You must be signed in to change notification settings - Fork 4
/
s2_Lab4_ANOVA.Rmd
866 lines (606 loc) · 39.3 KB
/
s2_Lab4_ANOVA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
```{r, include = FALSE}
source("global_stuff.R")
```
# ANOVA
## Readings
Chapters 7 and 8 from [@abdiExperimentalDesignAnalysis2009]. You can also check out my chapter on ANOVA [@crumpAnsweringQuestionsData2018] here [https://crumplab.github.io/statistics/anova.html](https://crumplab.github.io/statistics/anova.html).
<iframe width="560" height="315" src="https://www.youtube.com/embed/HPPH3Jgn0nk" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
## Overview
The next three weeks discuss one-factor Analysis of Variance (ANOVA). We can think of ANOVA as an extension of the t-test idea, that can be applied to designs where the independent variable (IV) has more than two levels, or to even more complicated designs involving multiple IVs (possibly with many levels each).
We begin with the one-factor ANOVA for designs with a single independent variable that has two or more levels. As we will see shortly, if we use ANOVA for a design with two levels, the result will be the same as a t-test. However, unlike a t-test, we can also use the ANOVA for designs with more than two levels.
This lab has three major concept sections to help firm up ANOVA concepts, and a practical section on using the `aov` function to compute ANOVAs in R.
Let's consider a few pictures first, and then move onto the lab:
```{r, echo=FALSE}
knitr::include_graphics('imgs/ANOVA/ANOVA.004.jpeg')
```
### Grand mean
```{r, echo=FALSE}
knitr::include_graphics('imgs/ANOVA/ANOVA.005.jpeg')
```
### Within group variation
```{r, echo=FALSE}
knitr::include_graphics('imgs/ANOVA/ANOVA.006.jpeg')
```
### Between group variation
```{r, echo=FALSE}
knitr::include_graphics('imgs/ANOVA/ANOVA.007.jpeg')
```
### F Ratio
```{r, echo=FALSE}
knitr::include_graphics('imgs/ANOVA/ANOVA.008.jpeg')
```
<!--
## Concept 1: Manipulations and using means to explain variance
We have been discussing the concept of explaining variance in lab and lecture for several weeks. For example, when we correlate on variable `X`, with another variable `Y`, we can compute the coefficient of determination, $R^2$, that quantifies how much of the variation in Y can be explained by X. More specifically, in the simple linear regression case, we are effectively quantifying how much of the data-points fall on a best-fit line. If all of the data-points in the scatterplot fall on the best-fit line, then we say the line explains all of the variation. If their is residual error, then we say the line explains some proportion of the variance, and there is a leftover part (the residuals), that remains unexplained. With ANOVA, we will be doing something similar; except, instead of trying to explain variation in a dependent variable based on a line, we will be explaining variation in terms of means.
What follows is a series of ideas and R code that could be helpful for understanding general statistical principles underlying the ANOVA logic. If we understand these principles, we should be confident when we use ANOVA because we will understand what it is capable (and not capable) of doing, and we will know how to extend the principles to develop other ANOVA-like tools if we thought that would be useful for our research purposes.
The big question we will ask is, what ought to happen to your dependent variable if your experimental manipulation is effective and truly does cause change in what you are measuring.
### Variation in your dependent variable
First, let's consider variability in our measurements outside of an experiment. As we know, many measurements in psychology are inherently variable. As a result, every time we measure something, `Y`, we know we will collect different values.
If we assume that our measurement is coming from some stable population distribution, then what do we assume will happen when we take measurements, that is, when we take sample data from the population?
We expect the sample data will have some variability too, but that the sample data will generally be distributed like the population distribution it came from.
For example, let's sample 10 numbers from a normal distribution with (mean = 0, sd =1) and do this twenty times.
The bars show the sample means, and the dots show the individual data points in each sample.
```{r}
library(tibble)
library(ggplot2)
sample_data <- tibble(sample = rep(1:20, each=10),
DV = rnorm(20*10,0,1))
ggplot(sample_data, aes(x=sample,y=DV))+
geom_bar(stat="summary",fun="mean", position="dodge") +
geom_point(size=1)+
geom_hline(yintercept = 0)+
theme_classic()
```
The graph represents taking different measurements from the same population multiple times over. Every time we get a slightly different sample, but as a general rule, the sample mean is close to the population mean (0), and the variability is pretty consistent across samples.
### Manipulation and Variation
Let's now consider what we expect to happen to the variation in our measurement in the context of an experiment. The primary working feature of an experiment is the manipulation, or the independent variable. If an experiment is correctly designed, it should be capable of **causing the measured variable to change in some way**. Our general question now is to consider what would happen to the variation in our measurement when the experimental manipulation works.
#### What ought to happen when the manipulation does not work?
Let's slightly change the graph from above. Let's imagine that the first 10 samples are measures we took from 10 subjects in group 1 of an experiment, and samples 11-20 are measures we took from 10 subjects in group 2 of an experiment. Let's further stipulate that manipulation does not work, and is incapable of causing the measurement to change.
What would we expect the data to look like? Remember of course that the measurement will still have some variability, so we would expect something like the following to happen:
```{r}
sample_data <- tibble(group = rep(c("1","2"),each=10*10),
subjects = rep(1:20, each=10),
DV = rnorm(20*10,0,1))
ggplot(sample_data, aes(x=subjects,y=DV))+
geom_bar(stat="summary",fun="mean", position="dodge",
aes(color=group,
fill=group)) +
geom_point(size=1)+
geom_hline(yintercept = 0)+
theme_classic()
```
Our question was: what would happen to the variation in our measurement when the experimental manipulation does not work. The answer is that nothing should happen to the variation in our measurement when the manipulation does nothing. Instead, we should expect to measure whatever variation is normally present in our dependent variable. In other words, there really isn't a difference between the first graph we made (20 samples from the same distribution), and this graph. Both involve randomly sampling 10 values from the same distribution, and repeating that process 20 times.
#### What ought to happen when the manipulation DOES work?
Let's get back to he big issue. If an experiment is correctly designed, it should be capable of **causing the measured variable to change in some way**. So, what would happen to the variation in our measurement when the experimental manipulation actually works?
There are three possible answers. The variation in our measure could increase, decrease, or stay the same. So which is it? The answer to this question in part depends on the nature of the causal influence from the manipulation. Let's be a little bit more specific and consider the following question. If the manipulation causes an increase or decrease in the mean of the dependent variable, what will happen to the variability in our measurement?
Before we simulate this in R, let's first establish how much variability there is in the first place. For example, how much variability is there in the dependent variable from our previous example?
We could simply take all of the data points, across all subjects in both groups and compute the variance.
```{r}
var(sample_data$DV)
```
Now, let's repeat the above experiment, but this time we will simulate the idea that the IV does actually change measurement. For example, let's say the manipulation in group 2 causes scores to raise dramatically by three whole standard deviations relative to group 1.
I accomplish this by changing the mu parameter from 0 to 3 in the `rnorm()` function for the second set of subjects. Notice, I did not change the standard deviation of population distribution, because I am assuming that the manipulation only causes a mean shift.
```{r}
sample_data <- tibble(group = rep(c("1","2"),each=10*10),
subjects = rep(1:20, each=10),
DV = c(rnorm(10*10,0,1),
rnorm(10*10,3,1))
)
ggplot(sample_data, aes(x=subjects,y=DV))+
geom_bar(stat="summary",fun="mean", position="dodge",
aes(color=group,
fill=group)) +
geom_point(size=1)+
geom_hline(yintercept = 0)+
theme_classic()
```
So, what has happened to the overall variability in the measurement? If we ignore the grouping variable, notice that now we have dots in more places than we did before. In previous graphs the dots ranged roughly between -2 and +2. Now, taken together they range between -2 and +4. So, **the manipulation caused the variability in the measurement to increase**.
We can confirm that the the total amount of variation has increased by computing the variance of all the scores. The variance has increased almost threefold from before.
```{r}
var(sample_data$DV)
```
I will leave it to you to determine what happens to the total variation if the manipulation caused a mean shift in the negative direction. For example, in the example above, group 2 was programmed to have larger more positive values than group 1. We could have made group 2 receive smaller more negative values than group 1. In general, no matter what direction the change is, if the IV causes changes in the means between the levels, then the presence of the IV should increase the total amount of variation relative control conditions where the IV is not manipulated (but a similar number of groups are measured).
### Sources of Variation
We have demonstrated above two sources of variation. First, we have the pre-existing variation in our dependent variable. Second, we have additional variation caused by the experimental manipulation. In general, we will talk about the pre-existing variation as "unexplained" variation. And, we will talk about the variation caused by the manipulation as "explained" variation. It is explained in the sense that we are saying something about the experimental manipulation was the source of the added variation.
The point of the preceding examples was to focus on the very general idea that total variation in a measurement scenario should change when the measurement conditions involve a causally effective experimental manipulation, compared to a control or causally ineffective manipulation. This a general reason why researchers are interested in "analyzing variances", because changes in variation can be a signature of causal influences that we are attempting to detect with our experiments.
### Using means to explain variation
Consider the process of using means to explain variance in observed data. Remember, that variances averages of the squared deviations between data points and their mean.
#### The grand mean
Let's use the experimental data from the last example, and try to explain the variance from the perspective of the grand mean. Here is the data again (resampled because I'm running the code again).
```{r}
sample_data <- tibble(group = rep(c("1","2"),each=10*10),
subjects = rep(1:20, each=10),
DV = c(rnorm(10*10,0,1),
rnorm(10*10,3,1))
)
ggplot(sample_data, aes(x=subjects,y=DV))+
geom_bar(stat="summary",fun="mean", position="dodge",
aes(color=group,
fill=group)) +
geom_point(size=1)+
geom_hline(yintercept = 0)+
theme_classic()
```
What is the grand mean? The grand mean is the mean across all of the data ignoring the groups.
```{r}
mean(sample_data$DV)
```
This is the same mean that we use to compute the total variance:
```{r}
var(sample_data$DV)
```
We could add the grand mean as a blue line to the graph:
```{r}
ggplot(sample_data, aes(x=subjects,y=DV))+
geom_bar(stat="summary",fun="mean", position="dodge",
aes(color=group,
fill=group)) +
geom_point(size=1)+
geom_hline(yintercept = 0)+
geom_hline(yintercept = mean(sample_data$DV), color="blue")+
theme_classic()
```
How well does the blue line explain the location of all of the dots? Well, it is the mean, so it is in the middle of all the dots. If you drew lines between each dot and the blue line, squared them, added them up (Sums of Squares, SS), and then averaged them, you would have the variance. We computed the variance to be about 3. The question we ask in ANOVA is whether our experimental manipulation can help "explain" some of this total variance.
Under the null hypothesis, all of the variance is explained by the grand mean, which represents the mean of the single distribution where the data come from. In other words, the manipulation doesn't do anything, so whatever variance we have is from the "pre-existing" variance.
When the experimental manipulation works, then we assume that some of the total variance we are measuring is actually due to the manipulation. In other words, without the manipulation the variance would have been smaller, but as we have shown the manipulation ought to increase the total variance. The next step is to **partition** the variance into difference sources. Specifically, how much of the total variance was "pre-existing", and how much was caused by the manipulation?
Before we proceed, I'm going to make a small change to the presentation of the data. The previous graphs show individual subject means (bars) and individual subject raw data (dots) for both groups. Let's look at a graph of the group means (two bars, one for each group), where the dots represent individual subject means.
```{r}
library(dplyr)
mean_data <- sample_data %>%
group_by(group,subjects) %>%
summarize(mean_dv = mean(DV))
ggplot(mean_data, aes(x=group,y=mean_dv))+
geom_bar(stat="summary",fun="mean", position="dodge",
aes(color=group,
fill=group)) +
geom_point(size=1)+
geom_hline(yintercept = 0)+
geom_hline(yintercept = mean(mean_data$mean_dv), color="blue")+
theme_classic()
```
The grand mean is the same as before:
```{r}
mean(mean_data$mean_dv)
```
And, the total variance is:
```{r}
var(mean_data$mean_dv)
```
Note, here we are computing the variance in the subject means with respect to the grand mean. Before, we computed the variance in all of the raw data with respect to the grand mean.
Remember, according to the null hypothesis, the grand mean of the sample data is the best estimate of the population mean, and it is the only basis for explaining the variation in the data. Consider the following table.
If the mean_dv column represents the mean scores for each subject. If we pretend that each score is just the sum of two components, the grand_mean plus a deviation (or squared difference between the orignal score and the grand mean), then we can see that from this viewpoint, all of the scores are the same (they are all the grand_mean), and the variance associated with this explanation is the average of the squared error.
```{r}
mean_data <- mean_data%>%
mutate(grand_mean = mean(mean_data$mean_dv)) %>%
mutate(squared_error = (mean_dv-grand_mean)^2)
knitr::kable(mean_data)
```
The variance is:
```{r}
var(mean_data$mean_dv)
```
And, the average of the sum of the squared error is the same thing.
```{r}
sum(mean_data$squared_error)/(20-1)
```
#### The means from each level of the IV
Let's consider a different perspective. What if we assume that the IV did cause a difference, such that the means between the groups should be different because of the manipulation.
One way to approach this issue is to think of each score as being replaced by it's group mean, and then to calculate how much variance is associated with this characterization.
```{r}
group_means <- mean_data %>%
group_by(group) %>%
summarize(group_mean = mean(mean_dv))
IV_mean_data <- mean_data %>%
ungroup() %>%
mutate(IV_means = rep(group_means$group_mean, each = 10)) %>%
mutate(IV_squared_error = (IV_means-grand_mean)^2)
knitr::kable(IV_mean_data)
```
If we treat each score as if the only thing we knew about it was the group mean it came from, then we could measure how much variance is associated with this perspective. For example:
```{r}
var(IV_mean_data$IV_means)
```
This is the same as the average of the squared errors between the group means and the grand mean.
```{r}
sum(IV_mean_data$IV_squared_error)/19
```
-->
## Practical 1: By hand ANOVA using R
```{r}
library(tibble)
romeo_juliet <- tibble(subjects = 1:20,
Group = rep(c("No Context",
"Context Before",
"Context After",
"Partial Context"), each = 5),
Comprehension = c(3,3,2,4,3,
5,9,8,4,9,
2,4,5,4,1,
5,4,3,5,4
)
)
romeo_juliet$Group <- factor(romeo_juliet$Group,
levels = c("No Context",
"Context Before",
"Context After",
"Partial Context"))
knitr::kable(romeo_juliet)
```
### Grand Mean and total sums of squares
```{r}
library(dplyr)
# get grand mean
grand_mean <- mean(romeo_juliet$Comprehension)
# get squared deviations from grand mean
SS_total_table <- romeo_juliet %>%
mutate(grand_mean = mean(romeo_juliet$Comprehension)) %>%
mutate(deviations = Comprehension - grand_mean,
sq_deviations = (Comprehension - grand_mean)^2)
#sum them
SS_total <- sum(SS_total_table$sq_deviations)
```
### SS Between
```{r}
# get group means
group_means <- romeo_juliet %>%
group_by(Group) %>%
summarize(mean_Comprehension = mean(Comprehension),.groups = 'drop')
# get squared deviations between grand mean and group means
SS_between_table <- romeo_juliet %>%
mutate(grand_mean = mean(romeo_juliet$Comprehension),
group_means = rep(group_means$mean_Comprehension, each = 5)) %>%
mutate(deviations = group_means - grand_mean,
sq_deviations = (group_means - grand_mean)^2)
SS_between <- sum(SS_between_table$sq_deviations)
```
### SS Within
```{r}
# get group means
group_means <- romeo_juliet %>%
group_by(Group) %>%
summarize(mean_Comprehension = mean(Comprehension), .groups = 'drop')
# get squared deviations between group means and original data points
SS_within_table <- romeo_juliet %>%
mutate(group_means = rep(group_means$mean_Comprehension, each = 5)) %>%
mutate(deviations = group_means - Comprehension,
sq_deviations = (group_means - Comprehension)^2)
SS_within <- sum(SS_within_table$sq_deviations)
```
Check that SS total = SS Between + SS Within:
```{r}
SS_total
SS_between+SS_within
SS_total == SS_between+SS_within
```
### F is a ratio of variances
```{r}
# Between groups variance
dfb <- 4-1
MS_Between <- SS_between/dfb
# Within groups variance
dfw <- 20-4
MS_Within <- SS_within/dfw
# compute F, a ratio of variances
F_ratio <- MS_Between/MS_Within
```
### Alternative example using matrix
```{r}
# represent data in a wide-format matrix
matrix_data <- matrix(c(3,3,2,4,3,
5,9,8,4,9,
2,4,5,4,1,
5,4,3,5,4),
ncol=4,
nrow=5)
colnames(matrix_data) <- c("No Context",
"Context Before",
"Context After",
"Partial Context")
## Sums of Squares
SS_total <- sum( (matrix_data - mean(matrix_data))^2 )
SS_between <- sum( (colMeans(matrix_data) - mean(matrix_data))^2 )*5
SS_within <- sum( (colMeans(matrix_data) - t(matrix_data))^2 )
## Mean squared errors
dfb <- 4-1
MS_Between <- SS_between/dfb
dfw <- 20-4
MS_Within <- SS_within/dfw
# compute F, a ratio of variances
F_ratio <- MS_Between/MS_Within
```
### Using the built-in `aov`
Note, we go into more detail about aov` in the practical section.
```{r}
romeo_juliet$Comprehension <- sample(romeo_juliet$Comprehension)
anova.out <- aov(Comprehension ~ Group, data = romeo_juliet)
summary(anova.out)
```
## Concept 1: Understanding the pieces of the ANOVA table
Most of the pieces of the ANOVA table (degrees of freedom, Sums of Squares, Mean squared error, and the F-value) are computed directly from the data. Algebraically, the Mean squared errors are the sum squares of divided by their respective degrees of freedoms. And, the F-value is a ratio: the Mean squared error for the group (Between) divided by the mean squared error for the residuals (Within). The p-value does not come from the data, in this case it comes from an F-distribution. This concept section is followed by a section on creating a simulated F-distribution. To prepare us for that concept, we will first focus on the pieces of the ANOVA table, and examine how each piece ought to behave in the context of experimental designs. Specifically, we will ask how each term ought to behave if the Null hypothesis is true as well as if it isn't.
In the textbook "Romeo & Juliet" example, the example data had 4 groups, and 5 observations per condition. The empirical question was whether the group manipulation caused changes in the means. The ANOVA was applied to that data set and a large F-value was found. This F-value was unlikely to be produced by the null-hypothesis, so that hypothesis was rejected. We will continue with this basic example, but examine it from a slightly more abstract perspective. We will ask how each of the terms in the ANOVA for this design out to behave if the Null is true, and if the Null is not true.
### SS Total
If the null is TRUE, then all of the individual subject means in all of the groups, are all assumed to be randomly sampled from the very same normal distribution. For example, this could be one random sample of 20 subject means for the above design. We will use a unit normal distribution throughout.
```{r}
sim_data <- matrix(rnorm(20,0,1),ncol=4,nrow=5)
```
$\text{SS}_\text{Total}$ is the sum of squared deviations between the grand mean (mean of all the data) and each data point. Assuming we are sampling from a unit normal distribution, how should $\text{SS}_\text{Total}$ behave if the Null hypothesis is true?
We can compute $\text{SS}_\text{Total}$ for the simulated data from above.
```{r}
SS_total <- sum( (mean(sim_data) - sim_data)^2 )
SS_total
```
But, that is just what $\text{SS}_\text{Total}$ happens to look like for that one random sample of 20 values from the normal distribution. Under the null, every time we sample 20 values we would expect slightly different values for $\text{SS}_\text{Total}$. Let's generate the sampling distribution of $\text{SS}_\text{Total}$. We would expect $\text{SS}_\text{Total}$ to take on a value from this distribution if the Null is TRUE.
```{r}
SS_total_distribution <- c()
for(i in 1:1000){
sim_data <- matrix(rnorm(20,0,1),ncol=4,nrow=5)
SS_total <- sum( (mean(sim_data) - sim_data)^2 )
SS_total_distribution[i] <- SS_total
}
hist(SS_total_distribution)
mean(SS_total_distribution)
```
What would happen to $\text{SS}_\text{Total}$ is the Null hypothesis is false? There are an infinity of ways that the null hypothesis could be false, but let's simulate one of them to see what could happen to $\text{SS}_\text{Total}$.
Specifically, let's imagine that manipulation works and it cause the first group to have a shifted mean of two standard deviations compared to the other groups.
```{r}
SS_total_distribution_alt <- c()
for(i in 1:1000){
sim_data <- matrix(rnorm(20,0,1),ncol=4,nrow=5)
sim_data[,1] <- sim_data[,1]+2 #add effect for group 1
SS_total <- sum( (mean(sim_data) - sim_data)^2 )
SS_total_distribution_alt[i] <- SS_total
}
hist(SS_total_distribution_alt)
```
The distribution of $\text{SS}_\text{Total}$ includes a smaller range values when the Null is TRUE, compared to distribution of $\text{SS}_\text{Total}$ for this one alternative scenario. **The big concept here is that a manipulation that causes changes in the means, should increase the overall variance in the data relative to the grand mean**. Thus, $\text{SS}_\text{Total}$ should tend to be larger when the null is false, compared to when it is true.
```{r}
library(ggplot2)
SS_total_data <- data.frame(SS_total = c(SS_total_distribution,
SS_total_distribution_alt),
type = rep(c("Null","Alternative"), each=1000))
ggplot(SS_total_data, aes(x=SS_total, group=type, fill=type))+
geom_histogram(position="dodge")
```
### SS Between
$\text{SS}_\text{Between}$ treats each score as if it was the group mean, and sums the squared deviations between the group mean and grand mean. According to the null, the group means should all be the same (on average), but they will have some variance because of random sampling. Thus, we can simulate a sampling distribution of $\text{SS}_\text{Between}$ that we would expect under the null.
```{r}
SS_between_distribution <- c()
for(i in 1:1000){
sim_data <- matrix(rnorm(20,0,1),ncol=4,nrow=5)
SS_between <- sum( (mean(sim_data) - colMeans(sim_data))^2 )*5
SS_between_distribution[i] <- SS_between
}
```
Let's also create a distribution of $\text{SS}_\text{Between}$ that would happen if our first group had a two standard deviation shift in the mean.
```{r}
SS_between_distribution_alt <- c()
for(i in 1:1000){
sim_data <- matrix(rnorm(20,0,1),ncol=4,nrow=5)
sim_data[,1] <- sim_data[,1]+2 #add effect for group 1
SS_between <- sum( (mean(sim_data) - colMeans(sim_data))^2 )*5
SS_between_distribution_alt[i] <- SS_between
}
```
It's clear by plotting the distributions together that $\text{SS}_\text{Between}$ will have larger numbers as the variation between the means gets larger.
```{r}
SS_between_data <- data.frame(SS_between = c(SS_between_distribution,
SS_between_distribution_alt),
type = rep(c("Null","Alternative"), each=1000))
ggplot(SS_between_data, aes(x=SS_between, group=type, fill=type))+
geom_histogram(position="dodge")
```
### SS Within
$\text{SS}_\text{Within}$ sums the squared deviations between each score and its group mean. We can simulate a sampling distribution of $\text{SS}_\text{Within}$ that we would expect under the null.
```{r}
SS_Within_distribution <- c()
for(i in 1:1000){
sim_data <- matrix(rnorm(20,0,1),ncol=4,nrow=5)
SS_Within <- sum( (colMeans(sim_data) - t(sim_data))^2 )
SS_Within_distribution[i] <- SS_Within
}
```
Let's also create a distribution of $\text{SS}_\text{Within}$ that would happen if our first group had a two standard deviation shift in the mean.
```{r}
SS_Within_distribution_alt <- c()
for(i in 1:1000){
sim_data <- matrix(rnorm(20,0,1),ncol=4,nrow=5)
sim_data[,1] <- sim_data[,1]+2 #add effect for group 1
SS_Within <- sum( (colMeans(sim_data) - t(sim_data))^2)
SS_Within_distribution_alt[i] <- SS_Within
}
```
Remember, we have created an alternative distribution by violating the null with a shift in the mean for the first group. We did not assume the manipulation had any effect on the variances. Thus, $\text{SS}_\text{Within}$ should be the same when the null is true or false (for mean shifts).
```{r}
SS_Within_data <- data.frame(SS_Within = c(SS_Within_distribution,
SS_Within_distribution_alt),
type = rep(c("Null","Alternative"), each=1000))
ggplot(SS_Within_data, aes(x=SS_Within, group=type, fill=type))+
geom_histogram(position="dodge")
```
### MS Between
$\text{MS}_\text{Between}$ is a variance, and remember variances are average sums of squares. In our case, we are "estimating" the variance due to the effect of the manipulation, so we divide by the degrees of freedom $\text{SS}_\text{Between}/df_B$.
There are four group means, and we are estimating them with respect to the grand mean, so the number of means free to vary are 4-1 = 3.
We can plot distributions of $\text{MS}_\text{Between}$ that we would when the null is true, and when it is false (using our alternative example). We can re-use the $\text{SS}_\text{Between}$ distributions and simply divide the values by $df_B$.
```{r}
MS_between_data <- data.frame(MS_between = c(SS_between_distribution/3,
SS_between_distribution_alt/3),
type = rep(c("Null","Alternative"), each=1000))
ggplot(MS_between_data, aes(x=MS_between, group=type, fill=type))+
geom_histogram(position="dodge")
```
### MS Within
$\text{MS}_\text{Within}$ represents average variation within group. ANOVA assumes manipulations **only** cause shifts in means, and from this perspective, $\text{MS}_\text{Within}$ represents variance that is not due to the manipulation.
We are again "estimating" the variance so we divide by the degrees of freedom $\text{SS}_\text{Within}/df_W$.
There are 20 subject means, and we are estimating them with respect to the four group means, so the number of means free to vary are 20-4 = 16.
We can plot distributions of $\text{MS}_\text{Within}$ that we would expect when the null is true, and when it is false (using our alternative example). Again, we can re-use the $\text{SS}_\text{Within}$ distributions and simply divide the values by $df_W$.
```{r}
MS_Within_data <- data.frame(MS_Within = c(SS_Within_distribution/16,
SS_Within_distribution_alt/16),
type = rep(c("Null","Alternative"), each=1000))
ggplot(MS_Within_data, aes(x=MS_Within, group=type, fill=type))+
geom_histogram(position="dodge")
```
### F
The last component of the ANOVA table is $F$, which is a ratio of the variances $\text{MS}_\text{Between} / \text{MS}_\text{Within}$.
Let's re-simulate the F-distribution we would expect under the null.
```{r}
F_distribution <- c()
for(i in 1:1000){
sim_data <- matrix(rnorm(20,0,1),ncol=4,nrow=5)
SS_between <- sum( (mean(sim_data) - colMeans(sim_data))^2 )*5
SS_Within <- sum( (colMeans(sim_data) - t(sim_data))^2 )
sim_F <- (SS_between/3) / (SS_Within/16)
F_distribution[i] <- sim_F
}
```
And, re-simulate the F-distribution we would expect for our alternative (that group one has a mean shift of two standard deviations).
```{r}
F_distribution_alt <- c()
for(i in 1:1000){
sim_data <- matrix(rnorm(20,0,1),ncol=4,nrow=5)
sim_data[,1] <- sim_data[,1]+2 #add effect for group 1
SS_between <- sum( (mean(sim_data) - colMeans(sim_data))^2 )*5
SS_Within <- sum( (colMeans(sim_data) - t(sim_data))^2 )
sim_F <- (SS_between/3) / (SS_Within/16)
F_distribution_alt[i] <- sim_F
}
```
And, then plot the F-distributions.
```{r}
F_data <- data.frame(F = c(F_distribution,
F_distribution_alt),
type = rep(c("Null","Alternative"),
each=1000))
ggplot(F_data, aes(x=F, group=type, fill=type))+
geom_histogram(position="dodge")
```
## Concept 2: Simulating the F-distribution
Where did the p-value come from? We have seen that the `aov` function does all of the work of computing the degrees of freedom, the sums of squares, the mean squares and the F-value directly from the provided data frame. Remember, that the F-value is essentially a descriptive statistic.
The summary function shows that this F-value is associated with a fairly small p-value (.00278). The `aov` function looks up the p-value using the F-distribution function.
```{r}
?pf
pf(7.227,3,16, lower.tail = FALSE)
```
However, where did the `F` distribution come from? The `F` distribution is just another **sampling distribution**. It is a hypothetical distribution of F values that would be obtained under the null hypothesis for this situation. We can obtain our own F-distribution using monte-carlo simulation. Let's see if we get clsoe.
```{r}
# replace the example data with random numbers from a unit normal
romeo_juliet$Comprehension <- rnorm(20,0,1)
aov.out <- aov(Comprehension ~ Group, data = romeo_juliet)
simulated_F <- summary(aov.out)[[1]]$`F value`[1]
# do the above a bunch of times
save_F_values <- length(10000)
for(i in 1:10000){
romeo_juliet$Comprehension <- rnorm(20,0,1)
aov.out <- aov(Comprehension ~ Group, data = romeo_juliet)
simulated_F <- summary(aov.out)[[1]]$`F value`[1]
save_F_values[i] <- simulated_F
}
# look at the F distribution
hist(save_F_values)
# get the p-value associated with 7.22
length(save_F_values[save_F_values>7.22])/length(save_F_values)
# do the above 10000 times
save_F_values <- length(10000)
for(i in 1:10000){
romeo_juliet$Comprehension <- rnorm(20,0,1)
aov.out <- aov(Comprehension ~ Group, data = romeo_juliet)
simulated_F <- summary(aov.out)[[1]]$`F value`[1]
save_F_values[i] <- simulated_F
}
# look at the F distribution
hist(save_F_values)
# get the p-value associated with 7.22
length(save_F_values[save_F_values>7.22])/length(save_F_values)
```
## Practical 2: One-way ANOVA with `aov` function
We will use the `aov()` function many times throughout the remainder of this course to perform various ANOVAs. This is a very brief practical section to show you a few bells and whistles useful for conducting ANOVAs in R.
We will continue to use the same example data from the textbook:
```{r}
romeo_juliet <- tibble(subjects = 1:20,
Group = rep(c("No Context",
"Context Before",
"Context After",
"Partial Context"), each = 5),
Comprehension = c(3,3,2,4,3,
5,9,8,4,9,
2,4,5,4,1,
5,4,3,5,4
)
)
romeo_juliet$Group <- factor(romeo_juliet$Group,
levels = c("No Context",
"Context Before",
"Context After",
"Partial Context"))
```
### `aov()`
The `aov()` function has two main inputs, the formula (`Comprehension ~ Group`), and the data frame (`data = romeo_juliet`). The dataframe must be organized in long-format.
The column name of the dependent variable (Comprehension) is placed on the left side of the `~`, and the column name of the factor (independent variable) is placed on the right (Group). So the formula `Comprehension ~ Group` can be read as, "analyze the means of Comprehension as a function of the levels of Group".
```{r}
anova.out <- aov(Comprehension ~ Group, data = romeo_juliet)
```
The `aov()` function produces a list object containing the pieces of the ANOVA model. Printing this to the console usually does not give you information you are looking for:
```{r}
anova.out
```
### `summary()`
The ANOVA table can be printed to the console using `summary()`. The `summary()` function takes an `aov()` model as it's input, and returns the ANOVA table as an output.
```{r}
summary(anova.out)
```
You could also write this as:
```{r}
summary(aov(Comprehension ~ Group, data = romeo_juliet))
```
### `model.tables()`
Another helpful function is `model.table()`, which extracts the means of each level in the analysis (from the `aov()` object), and prints them to the console.
```{r}
model.tables(anova.out)
```
### Extracting model terms
In practice, you might use `aov()`, `summary()`, and `model.tables()` to quickly print out an ANOVA and look at it. For example, let's I wanted to run an ANOVA on the example data and look at it in R, I could do this:
```{r}
anova.out <- aov(Comprehension ~ Group, data = romeo_juliet)
summary(anova.out)
model.tables(anova.out)
```
Great, but what if I wanted to access specific values in the ANOV table. You can extract these values by finding the correct indices into the list objects generated by `aov()` and `summary()`.
```{r}
my_summary <- summary(anova.out)
my_summary[[1]]$Df
my_summary[[1]]$`Sum Sq`
my_summary[[1]]$`Mean Sq`
my_summary[[1]]$`F value`
my_summary[[1]]$`Pr(>F)`
```
You could use the above strategy to help you insert specific values into your R markdown document.
### Reporting with `papaja`
The `papaja` package has many useful functions that allow you to extract pieces of the ANOVA that you want to report in a document, such as in a results section.
Check out all of the reporting options you have by using `apa_print()`, and inputting an ANOVA object.
```{r}
library(papaja)
papaja::apa_print(anova.out)
papaja::apa_print(anova.out)$full_result$Group
```
To give an example, current APA style for reporting an F-test is something like the following, F(dfb, dfb) = F-value, MSE = MS_within, p = p-value, and maybe a partial-eta squared value. All of this information is available in `papaja::apa_print(anova.out)$full_result$Group`, and if you insert it as an inline chunk, you get this `r papaja::apa_print(anova.out)$full_result$Group`.
You can also created tables that are formatted to APA style. In this case I print out the ANOVA table to the document in slightly nicer way than the console print out.
```{r}
papaja::apa_table(apa_print(anova.out)$table)
```
## Lab 4 Generalization Assignment
<iframe width="560" height="315" src="https://www.youtube.com/embed/Y04e14wWh78" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
### Instructions
Your assignment instructions are the following:
1. Work inside the new R project for stats II that you created
2. Create a new R Markdown document called "Lab4.Rmd"
3. Use Lab4.Rmd to show your work attempting to solve the following generalization problems. Commit your work regularly so that it appears on your Github repository.
4. **For each problem, make a note about how much of the problem you believe you can solve independently without help**. For example, if you needed to watch the help video and are unable to solve the problem on your own without copying the answers, then your note would be 0. If you are confident you can complete the problem from scratch completely on your own, your note would be 100. It is OK to have all 0s or 100s anything in between.
5. Submit your github repository link for Lab 4 on blackboard.
### Problems
1. Consider the following example data for a between-subjects experiment with two groups, A and B:
```{r}
example_data <- tibble(Group = rep(c("A","B"), each = 5),
DV = c(2,4,3,5,4,7,6,5,6,7))
```
Use R to conduct a t.test and ANOVA on this data. Then use R to prove that the results of both analyses are the same. For example, prove that the p-values are the same, and prove that the F-value and T-value are related. (3 points)
2. Look at the lab on ANOVA that I wrote for our undergraduate statistics OER lab manual <https://crumplab.github.io/statisticsLab/lab-8-one-way-anova.html>. That lab shows an example of obtaining data from a published paper in psych science where a one-factor ANOVA was used as a part of the analysis. Load the data, conduct the ANOVA, report a ggplot of the means, and use papaja to help you write a short results section reporting the ANOVA result. (3 points).