forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter2.Rmd
123 lines (68 loc) · 4.01 KB
/
chapter2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# Chapter 2 - Regression and model validation
*Describe the work you have done this week and summarize your learning.*
- Describe your work and results clearly.
- Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
- Assume the reader has no previous knowledge of your data or the more advanced methods you are using.
```{r}
date()
```
*Reading the data frame*
```{r}
library(ggplot2)
library(GGally)
l2014<-read.table('data/learning_2014_out.csv', header = T, sep = ',', stringsAsFactors = F)
```
Looking at the data structure
```{r}
str(l2014)
```
```{r}
head(l2014)
```
The data consists of 7 variables and 166 observations.
It includes students' performance metrics in three subject categories: deep learning "deep", strategic learning "stra", surface learning "surf" and exam point in "points" and attitude in their own variables. The deep and surf are based on 12 average of 12 questions from each subjects and stra is based on aaverage of 8 strategic questions.
The dataset also contains students' gender and age in their respective variables.
```{r}
summary(l2014)
par(mfrow = c(2,3))
hist(l2014$age)
hist(l2014$attitude)
hist(l2014$surf)
hist(l2014$deep)
hist(l2014$stra)
hist(l2014$points)
```
Besides age, the variables are fairly normally distributed, stra variable is also "middle-heavy"
## Exxploring variables and relationships
```{r}
(p <- ggpairs(l2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20))))
```
From the graphical summary we can see that there were significantly more woman than men.All in all the variables were fairly equal in both genders, notably males had slightly higher attitude scroe compared females, whereas females had bit higher surface score compared to males.
Exam points are most highly significantly correlated with attitude and somewhat correlated strategic questions, and negatively correlated with surface questions.
Age did not significantly correlate with exam points and other variables.
## Fitting regression model
```{r}
# create a regression model with multiple explanatory variables
my_model <- lm(points ~ deep + stra + attitude, data = l2014)
# print out a summary of the model
summary(my_model)
```
Model's multiple R-squared is 0.2097. This means that about 21% the of variance of points that can be explained by the attitude, stra and deep.
Model's Residual standard error: 5.289 on 162 degrees of freedom.
We can see that the "stra" and "deep" do not have statistically significant relationship with the points (P-value > 0.05). The only significant explanatory variable "attitude" has a positive relationship with points (Estimate: 3.5254), and it's highly significant with P-value of 4.44e-09. The model's non-zero crossing of X-axis is also highly significant.
Therefore we drop these 2 variables and linear model with points as target variable and only attitude as explanatory variable
```{r}
# fit a linear model with points as target variable and attitude as explanatory variable
my_model2 <- lm(points ~ attitude, data = l2014)
# print out a summary of the model
summary(my_model2)
```
Here we see that the attitude still has significant relationship with points (P-value = 4.12e-09), when it's the only explanatory variable. Model's multiple R-squared remained at 0.1906, indicating that dropping "deep" and "stra" only affected model's performance little.
## Model validation
```{r}
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
plot(my_model2, which = c(1,2,5))
```
Residuals vs. fitted values plot shows that modeol's explanatory variables don't seem to have too much effect on residuals.
The QQ-plot shows that the normality assumption of fairly well, as most of the residuals remain near the normality line.
At Residuals vs leverage plot we can see that some points have bit high leevrage (x-axis values) but even these aren't outside Cook's distance, so model shouldn't be affected too much by them.