-
Notifications
You must be signed in to change notification settings - Fork 0
/
3 Regression_OLS_R_Squared_Parameter Significance_Variable Importance.Rmd
243 lines (176 loc) · 7.04 KB
/
3 Regression_OLS_R_Squared_Parameter Significance_Variable Importance.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: "Regression, OLS,Model Quality ,Parameter Significance, Variable Importance"
output:
word_document: default
html_document: default
date: ""
editor_options:
markdown:
wrap: 72
---
# 1. Regression
In many cases, researchers/business analysts are interested in
determining relationships of variables, e.g. advertisement expenditures
and sales.
We always have one variable that we would like to explain that is called
Y or dependent variable, and we have a set of variables that should
explain Y which we call X or explanatory or independent variables.
A regression analysis will always yield a line that explains the
relationship between the explanatory variables and the dependent
variable in the best possible way (in the case of one explanatory
variable): $Y =\alpha + \beta*X$
However, since we work with observational data, there are also sources
of errors that prevent our line to fit the data perfectly well such as
measurement error or non-linear relationships or omitted variables.
We will work with linear regressions only. Hence, we assume that the
relationship between the explanatory variables and the dependent
variable is linear! This assumption will be somewhat relaxed later on.
So how do we find the line fitting our data best?
# 2. Ordinary least squares (OLS)
Ordinary least squares (OLS) generates a line such that the sum of the
squared errors of our prediction of the dependent variable (Y) given the
explanatory variable (X) in our sample is minimized.
Under certain assumptions, the OLS estimator is BLUE, i.e. the best,
linear, unbiased estimator.
Another advantage is the easy interpretation of the results: What
happens to Y if X increases by one unit, HOLDING ALL OTHER VARIABLES
CONSTANT? This is basically given by the respective coefficient estimate
of X.
# 3. Quality of our model
The most famous "quality" measure is the R^2^. It basically tells us
what share/ percentage in the variation of Y is explained by all X
jointly. It ranges from 0 to 1. $R^2 = ESS/TSS$
where $ESS$ is the explained sum of squares and $TSS$ is the total sum
of squares
If none of our X does have explanatory power to predict Y,R^2^ will be
equal to zero. If our explanatory variables are able to perfectly
predict Y, R^2^ will be equal to one.
An issue with R^2^ is that it increases by adding more variables to the
model, even though the new X have no/very little explanatory power. The
estimator will always be able to establish some relationship between X
and Y.
Solution: Adjusted R-squared which corrects for the number of parameters
included in the model.
# 4. Significance of parameters
So far, we have only looked at the overall quality of our model since
R^2^ tells us what share of the variation in Y is explained by all X
together. However, we would also like to know whether the relationship
between each of the single X and Y is significantly different from zero.
We can use a t-test for this again. It is now calculated by: t =
$(\hat{\beta}-0)/se(\hat{\beta})$
Moreover, we can test the Nullhypothesis that ALL parameters are JOINTLY
equal to zero. This can be done using the F-Test.
$F = (ESS/(k-1))/(RSS/(n-k))$
where k is the number of parameters to be estimated n is the number of
observations.
# 5. Importance of variables
The size of a coefficient estimate does not tell us much about the
importance of an explanatory variable to explain our dependent variable
since we measure the variables in different units.
Solution: Standardized coefficients.
$\hat{\beta}stand = \hat{\beta}*sd(X)/sd(Y)$ where
$\hat{\beta}stand$ is the standardized coefficient estimate of
$\hat{\beta}$
$sd(X)$ is the standard deviation of X
$sd(Y$) is the standard deviation of Y
# Theoretical Example
Let start with an illustration of OLS in a quite general way by
evaluating a rather artificial example
```{r message=FALSE, warning=FALSE}
n = 200
xone=seq(1:n)
yone = 1.5*xone + rnorm(n=n, mean = 3, sd=25)
plot(xone,yone)
olsone = lm(yone~xone)
summary(olsone)
plot(xone,yone,abline(lm(yone~xone)))
```
What happens, if we decrease the noise?
```{r}
n = 200
xtwo=seq(1:n)
ytwo = 1.5*xtwo + rnorm(n=n, mean = 3, sd=5)
plot(xtwo,ytwo)
olstwo = lm(ytwo~xtwo)
summary(olstwo)
plot(xtwo,ytwo,abline(lm(ytwo~xtwo)))
```
What happens, if the relationship was not linear?
```{r}
xthree=seq(1:n)
ythree = log(xthree) + rnorm(n=n, mean = 3, sd=0.5)
plot(xthree,ythree)
olsthree = lm(ythree~xthree)
summary(olsthree)
plot(xthree,ythree,abline(lm(ythree~xthree)))
```
We can solve this by tranforming our X variable
```{r}
logxthree = log(xthree)
plot(logxthree,ythree)
olsfour = lm(ythree~logxthree)
summary(olsfour)
plot(logxthree,ythree,abline(lm(ythree~logxthree)))
```
# Real Example
Let's use real data: the mpg data set (publicly available) is shipped
with ggplot2
The variables contained in the datset are:
manufacturer = manufacturer of the car
model = model
displ = engine displacement in liters
year = year of manufacturing
cyl = number of cylinders
trans = type of transmission (automatic/manual)
drv = drive type front, rear and 4-wheel
cty = city mileage in miles per gallon
hwy = highway mileage in miles per gallon
fl = fuel type
class = vehicle class (SUV etc.)
```{r}
library(ggplot2)
dat = mpg
hmile = lm(hwy~displ+year, dat=dat)
summary(hmile)
```
The $TSS$, $RSS$ and $ESS$ for our model are
```{r}
TSS = var(dat$hwy)*(nrow(dat)-1)
dat$errsq = (hmile$residuals)^2
RSS =sum(dat$errsq)
ESS=TSS-RSS
TSS
RSS
ESS
```
The $R^{2}$ can then be calculated manually as
```{r}
R_squared = ESS/TSS
R_squared
```
whereas the $F-statistic$ is calculated manually as
```{r}
fishstat = (ESS/(length(coefficients(hmile))-1))/
(RSS/(nrow(dat)-length(coefficients(hmile))))
fishstat
```
Of course, the above information together with any additional concerning
the coefficients, their standard errors and the corresponding $p-values$
is already included in the collective table above. We realize
- The value of the $R^{2}=0.6$ is quite big, so the model explain
about 60% of the variation of $Y=hwy$
- The coefficient of $disp$ is **significant**(not zero) at any level
of significance(as $p-value <<1$)) whereas the coefficient of $year$
is **significant**(not zero) at a significance level of greater than
99%(as $p-value = 0.005 <0.01$)
- As already expected, the two variables are also jointly significant
at any level of significance (as $F-statistic = 173.5$ with
$p-value <<1$)
However, the fact the coefficients of the respective variables are
significant does not imply that the respective variable is also
important! To examine that we should standardize **the coefficients**
first
The standardized coefficient of $year$ becomes 0.118 from 0.156 whereas
that of $displ$ decreases from -3.611 to -0.783. We observe that the the
**variable** $displ$ **is not as important** as it would naively seem
without standardizing the coefficients.