-
Notifications
You must be signed in to change notification settings - Fork 0
/
wpa8_solutions.Rmd
240 lines (172 loc) · 9.64 KB
/
wpa8_solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "Solutions to WPA8"
---
## 3. Now it's your turn
## Student Performance
In this WPA, you will analyze data from a study on student performance in two classes: math and Portugese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#
**A1. Download the data from the website (by clicking on Data Folder, and unzipping the downloaded student folder on your computer). We are going to use one of the files contained in the student folder: student-mat.csv. Load it in R as student_math. Inspect the dataset first.**
```{r}
library(tidyverse)
student_math = read_csv2('data/student-mat.csv')
glimpse(student_math)
head(student_math)
```
**A2. Create a regression object called model_fit_math_1 predicting first period grade (G1) based on age. How do you interpret the relationship between age and first period grade? Respond in terms of both model parameters fit and overall model fit. Were the model assumption violated? Respond using the plots we have seen in class. Finally, make a scatterplot with the regression line to illustrate such relationship as we have seen in previous assignments.**
```{r}
model_fit_math_1 = lm(G1 ~ age, data = student_math)
summary(model_fit_math_1)
# looks like there is no relationship with age
# the age coefficient is not significant, as well as the F-test, meaning that the intercept model is better than the intercept+age model
```
```{r}
ggplot(data = student_math, mapping = aes(x = age, y = G1)) +
geom_point(alpha = 0.8, size= 1) +
geom_smooth(method = lm, color='red')
```
```{r}
# Obtain standardised residuals:
res_stu = rstudent(model = model_fit_math_1)
# Obtain model's predictions:
pred = model_fit_math_1$fitted.values
model_checks = data.frame(pred = pred, res_stu = res_stu)
model_checks = as_tibble(model_checks)
```
```{r}
ggplot(data = model_checks, mapping = aes(x = res_stu)) +
geom_histogram(aes(y=..density..), binwidth=.1, colour="darkgrey", fill="white") + # Note: add aes(y=..density..) to have density instead of frequencies
labs(x = 'Studentized residuals', y='Density') +
geom_density(alpha=.2, fill="red", colour="darkgrey") + # Overlay with transparent density plot
geom_vline(aes(xintercept=mean(res_stu)), color="red", linetype="dashed", size=.5) # Add mean residuals
ggplot(model_checks, mapping = aes(sample = res_stu)) +
stat_qq() +
stat_qq_line()
ggplot(data = model_checks, mapping = aes(x = pred, y = res_stu)) +
geom_point(alpha = 0.6, size= 2) +
geom_hline(yintercept=0)
```
You can also use a more formal test for normality:
```{r}
shapiro.test(model_checks$res_stu)
```
This test shows that we can reject the null hypothesis that the residuals came from a normal distribution.
See [here](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) for more info.
And you can also use a more formal test for heteroskedacity:
```{r}
#install.packages('lmtest')
#install.packages('zoo')
library(lmtest)
```
```{r}
bptest(model_fit_math_1)
```
This test shows that we cannot reject the null that the variance of the residuals is constant, thus heteroskedacity is not present.
See [here](https://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test) for more info.
**A3. Create a regression object called model_fit_math_2 predicting first period grade (G1) based on absences. How do you interpret the relationship between absences and G1? Respond in terms of both model parameters fit and overall model fit. Were the model assumption violated? Respond using the plots we have seen in class. Finally, make a scatterplot with the regression line to illustrate such relationship as we have seen in previous assignments.**
```{r}
model_fit_math_2 = lm(G1 ~ absences, data = student_math)
summary(model_fit_math_2)
# looks like there is no relationship with absences
# the absences coefficient is not significant, as well as the F-test, meaning that the intercept model is better than the intercept+absences model
```
```{r}
ggplot(data = student_math, mapping = aes(x = absences, y = G1)) +
geom_point(alpha = 0.8, size= 1) +
geom_smooth(method = lm, color='red')
```
```{r}
# Obtain standardised residuals:
res_stu = rstudent(model = model_fit_math_2)
# Obtain model's predictions:
pred = model_fit_math_2$fitted.values
model_checks = data.frame(pred = pred, res_stu = res_stu)
model_checks = as_tibble(model_checks)
```
```{r}
ggplot(data = model_checks, mapping = aes(x = res_stu)) +
geom_histogram(aes(y=..density..), binwidth=.1, colour="darkgrey", fill="white") + # Note: add aes(y=..density..) to have density instead of frequencies
labs(x = 'Studentized residuals', y='Density') +
geom_density(alpha=.2, fill="red", colour="darkgrey") + # Overlay with transparent density plot
geom_vline(aes(xintercept=mean(res_stu)), color="red", linetype="dashed", size=.5) # Add mean residuals
ggplot(model_checks, mapping = aes(sample = res_stu)) +
stat_qq() +
stat_qq_line()
ggplot(data = model_checks, mapping = aes(x = pred, y = res_stu)) +
geom_point(alpha = 0.6, size= 2) +
geom_hline(yintercept=0)
```
```{r}
shapiro.test(model_checks$res_stu)
bptest(model_fit_math_2)
```
**A4. Create a regression object called model_fit_math_3 predicting first period grade (G1) based on school support. How do you interpret the relationship between school support and G1? Respond in terms of both model parameters fit and overall model fit. Were the model assumption violated? Respond using the plots we have seen in class. Finally, make a scatterplot with the regression line to illustrate such relationship as we have seen in previous assignments.**
```{r}
model_fit_math_3 = lm(G1 ~ schoolsup, data = student_math)
summary(model_fit_math_3)
# looks like there is a relationship with absences
# the absences coefficient is significant, as well as the F-test, meaning that the schoolsup+intercept model is better than the intercept model
```
```{r}
student_math = mutate(student_math,
schoolsup_bin = recode(schoolsup, "yes" = 1, "no" = 0))
```
```{r}
distinct(student_math, schoolsup, schoolsup_bin)
```
```{r}
# To make the plot "work" I created the dummy variable schoolsup_bin
ggplot(data = student_math, mapping = aes(x = schoolsup_bin, y = G1)) +
geom_point(alpha = 0.8, size= 1) +
geom_smooth(method = lm, color='red')
```
```{r}
# Note that the result is the same using the dummy variable:
model_fit_math_3 = lm(G1 ~ schoolsup_bin, data = student_math)
summary(model_fit_math_3)
```
```{r}
# Obtain standardised residuals:
res_stu = rstudent(model = model_fit_math_3)
# Obtain model's predictions:
pred = model_fit_math_3$fitted.values
model_checks = data.frame(pred = pred, res_stu = res_stu)
model_checks = as_tibble(model_checks)
```
```{r}
ggplot(data = model_checks, mapping = aes(x = res_stu)) +
geom_histogram(aes(y=..density..), binwidth=.1, colour="darkgrey", fill="white") + # Note: add aes(y=..density..) to have density instead of frequencies
labs(x = 'Studentized residuals', y='Density') +
geom_density(alpha=.2, fill="red", colour="darkgrey") + # Overlay with transparent density plot
geom_vline(aes(xintercept=mean(res_stu)), color="red", linetype="dashed", size=.5) # Add mean residuals
ggplot(model_checks, mapping = aes(sample = res_stu)) +
stat_qq() +
stat_qq_line()
ggplot(data = model_checks, mapping = aes(x = pred, y = res_stu)) +
geom_point(alpha = 0.6, size= 2) +
geom_hline(yintercept=0)
```
```{r}
shapiro.test(model_checks$res_stu)
bptest(model_fit_math_3)
```
They are both significant: both assumptions have been violated.
**A5. Given that school support is a nominal variable with 2 levels, how can you tell from the output which direction the effect is? How does this relate to the way the dataset has stored the levels of the school support factor?**
There is a negative relationship between school support and first period grade (b = -2.1): Students with extra support from the school have worse Period 1 grades on average.
In the regression, rather than a predictor labelled schoolsup because it is a nominal variable we are given the variable schoolsupyes.
This tells us that the regression coefficient is the change in grade caused by going from no school support to yes school support.
In essence, R has dummy coded the variable so that students with no school support are given a value of 0 and those with yes school support a value of 1.
If we look at the factor levels for the school support column we see that no is level 1 and yes level two. 'lm()' will always code predictors as the change
from level 1 to level 2 (i.e. level 1 is coded as 0, and level 2 as 1).")
```{r}
levels(factor(student_math$schoolsup))
```
**A6. From the regression, what would be your best guess for the first period grade for a student with no school support? What about for a student with school support?**
The intercept estimate tells us the best estimate for first period grade when all predictor vairables have a value of 0.
In this case, we only have 1 predictor (school support) and it is dummy coded so that those without school support have a value of 0.
Therefore, the predicted first period grade for those without school support is 11.18 (the intercept estimate).
We can also use our schoolsupportyes estimate to calculate our prediction for those with school support: Our estimate tells us that those on school support are predicted to have a first period grade that is 2.1 lower than those without school support,
therefore they have a predicted grade of 9.08.
Because we have a single categorical predictor, these predicted scores should also be the mean of each of the groups:
```{r}
summarize(group_by(student_math, schoolsup),
mean_G1 = mean(G1))
```