-
Notifications
You must be signed in to change notification settings - Fork 0
/
labreport1.Rmd
328 lines (219 loc) · 18.7 KB
/
labreport1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
---
title: "210B Lab Report 1"
author: "Anagha Uppal"
date: "2/19/2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### PART 1:
####Background
For an introduction to multiple linear regression, we have access to a dataset called "SmallHHFile". This file contains a variety of demographic and travel-related numerical, categorical, dichotomous and nominal information on households in an area. These include household income; number of people in the household; number of people who are employed, students, and have driver's licenses; number of cars, new cars and bikes; number of trips they take and miles they travel; whether the house is located in an urban, surburban or rural area; the type of home, and more.
We use this data to estimate the travel habits of these households. For the first model, for instance, we want to know what affects the number of miles traveled daily per person per household. For the second, we are interested in a similar metric - number of trips per person or per household. It seems likely that trips and miles would covary and would be dependent on things like the number of cars the household had, what kind of people live there (kids who can't drive or don't need to travel versus adults who do) and especially the rurality of the household - rural houses would probably need to travel further into town to access their needs.
For the first part of this lab, we will simply load a summary of this data file in two separate ways. We then move on to creating the models in parts 2 and 3.
*****
```{r, message=FALSE}
## Setting up and choosing libraries and importing data
setwd("~/Desktop/Anagha/UCSB/Classes/GEOG 210B")
if(require('readr')==FALSE) {
install.packages('readr',
repos="https://cran.rstudio.com/",
quiet=TRUE)
require('readr')
}
library(psych)
if(require('regclass')==FALSE) {
install.packages('regclass',
repos="https://cran.rstudio.com/",
quiet=TRUE)
require('regclass')
}
library(stargazer)
library(lmtest)
if(require('car')==FALSE) {
install.packages('car',
repos="https://cran.rstudio.com/",
quiet=TRUE)
require('car')
}
if(require('leaps')==FALSE) {
install.packages('leaps',
repos="https://cran.rstudio.com/",
quiet=TRUE)
require('leaps')
}
```
```{r data_import, message=FALSE}
## Importing data and initial description
SmallHHfile <- read_csv("SmallHHfile.csv")
```
```{r}
summary(SmallHHfile)
describe(SmallHHfile) #different way to describe the data available
```
*********
### PART 2:
#####Background
Our first linear model attempts to estimate MilesPr using the dependent variables specified in the lab. These are the dummy indicators Monday-Sunday representing the day traveled, along with number of cars in household and number of persons in household, as well as location of residence. Other than the travel day, these are exactly the variables I most expect to be significant in explaining the variation in the y.
#####Model Overview
We will consider the following tests and diagnostics to understand the model: first, we will visually check the validity of the data including normality, linearity and equal spread using plots and residuals. We then consider the anova table for the model constructed, which is the most appropriate diagnostic given many of the dependent variables are dichotomous. We then also examine the model summary, coefficients' confidence intervals, Durbin-Watson's test for serial autocorrelation, White's test for heteroskedacity, VIF and AIC.
Summary of the results of each test will appear under the test itself.
```{r lm}
#creating linear regression model
linearMod <- lm(MilesPr ~ Mon + Tue + Wed + Thu + Fri+ Sat + HHVEH + HHSIZ + suburb + exurb+ rural, data=SmallHHfile)
```
Equation:
$$19.0766418 - 0.6796353*Mon + 0.7147255*Tue + 0.9586046*Wed + 0.6662031*Thu + 3.7856796*Fri + 3.5689737*Sat + 4.8943467*HHVEH - 2.3604597*HHSIZ + 3.1183357*suburb + 6.2514926*exurb + 6.9114198*rural$$
```{r residuals}
##check_regression(linearMod)
#initial visual check of model
residuals <- resid(linearMod)
summary(residuals)
#ks.test(linearMod)
```
I have no idea why my check_regression function doesn't work and doesn't automatically switch from the Shapiro-Wilkes test to the Kolmogorov test due to the sample size being greater than 5000. So I have to produce these tests individually. The first of these would be an examination of the residuals-fitted plot and the qq-plot for normality.
We see that the mean of residuals is 0, as occurs for a line of best fit.
```{r residplot}
plot(resid(linearMod),fitted(linearMod))
```
It looks like there is a trend in the residual plot, indicating that the regression line might not have been computed correctly. The data is also definitely heteroscedastic.
```{r plotting}
plot(linearMod)
```
We see that the qq-plot is definitely not normal; in fact, it is highly skewed.
****
Since the predictor variables are mostly categorical and a few dichotomous variables, ANOVA is a better measure than linear regression.
```{r anova}
anova(linearMod)
```
The difficulty with ANOVA is that it cannot speak to the value of the model as a whole, only compare individual coefficients. For these individual coefficients, Monday and Saturday, along with HHVEH and HHSIZ and the exurb/rural locations are significant. It seems strange that suburb, for instance, does not also show significance alongside the other dummy variables for that category.
```{r}
summary(linearMod)
```
There are three components to pay attention to here: the F-test, the R-squared values and the individual coefficient p-values.
In this model, the F test statistic is quite high - 74.03 - and highly significant, with an extremely low p-value. This tells us that our model is a significantly better fit to the data than a model without any predictor variables.
However, the amount of variability of the independent variable (number of miles traveled per household) described by the x variables we have chosen is not high. Only about 19% of the data can be explained.
Of that variability explanation, the Friday and Saturday, along with HHVEH, HHSIZ and where the household is located (whether in the suburbs, exurbs or rural areas) are most significant in producing that variability.
```{r}
possible_regressions(linearMod)
```
Both the scatterplot-fitted regression line and the reduction in SSE plots indicate that the fitted regression line are unlikely to have occurred by chance. It is very unlikely that no relationship exists between the variables.
```{r confint}
confint(linearMod, level = 0.95)
```
0 appeared within the confidence interval for Monday, Tuesday, Wednesday, and Thursday. Through this diagnostic, every other variable is significant. Most of these seem to overlap with other summaries and tests.
```{r durbin-watson}
dwtest(linearMod)
```
The data is serially autocorrelated, i.e. there is a time-based effect on the model. The p-value for the test is 0.003.
```{r heteroscedacity}
bptest(linearMod)
coeftest(linearMod,vcov=hccm(linearMod))
```
White's test is most useful when the heteroskedacity is non-linear and the error terms are not normally distributed. In this case, heteroskedacity is present (high BP; low p-value), so we should use a covariance matrix. I don't really know what to do with the coeftest information.
---> coeftest actually goes back and changes all the standard error, use HC3
```{r collinear}
VIF(linearMod)
#generalization_error()
```
VIF values for the variables aren't very high, indicating a lack of collinearity between the predictor variables.
```{r aic}
#ALL <- regsubsets(MilesPr~.^2,data=SmallHHfile,method="exhaustive",nvmax=15,nbest=4, really.big = T)
#see_models(ALL,aicc=TRUE,report=5)
extractAIC(linearMod)
```
The AIC looks to be quite high, suggesting that this is a good-quality model. Unfortunately, the functions to compare models to choose the ones with the highest AIC isn't executing properly, so I can't establish whether there are better models to be made (without a lot of manual labor).
***********
### PART 3:
#####Background
Our second linear model continues to estimate travel behavior using the SmallHHFile data. In this iteration, we have all the variables open to us to experiment with, and we must choose a dependent variable between TrpPrs (number of trips per person) or HTRIPS (trips per household). However, I personally speculated that the indicators used in the first model were the most likely to explain y variability, so I don't expect this model will fare much better with more variables (in fact, it may fare worse, particularly with adjusted R squared).
#####Model Overview
We will consider the following tests and diagnostics to understand the model: first, we will visually check the validity of the data including normality, linearity and equal spread using plots and residuals. We then consider the anova table for the model constructed, which is the most appropriate diagnostic given many of the dependent variables are dichotomous. We then also examine the model summary, coefficients' confidence intervals, Durbin-Watson's test for serial autocorrelation, White's test for heteroskedacity, VIF and AIC. These are the same tests as were used in the first part. As before, summary of the results of each test will appear under the test itself.
To create the models, I will remove the dependendent variable choice not used as an explanatory variable because TrpPrs and HTRIPS would probably covary. I will also remove SAMPN that refers to unique identifier.
```{r}
linearMod2 <- lm(TrpPrs~.-HTRIPS-SAMPN, data=SmallHHfile)
summary(linearMod2)
linearMod3 <- lm(HTRIPS~.-TrpPrs-SAMPN, data=SmallHHfile)
summary(linearMod3)
stargazer(linearMod,linearMod2,linearMod3, type="text", title="Regression Results")
```
I thought the stargazer comparison chart might be useful in comparing between the dependent variables being explained to choose a better model. Although the significant coefficients don't seem to vary greatly between the two models, yet HTRIPS has about double the R^2 value that TrpPrs does. We can choose HTRIPS as the dependent. I will use linearMod3 for the rest of the tests, where the y is HTRIPS, or, number of trips per household per day.
After looking at the summary, of linearMod3, I see NAs in the coefficient and error terms for a few variables, indicating singularity/multicollinearity. I will remove those variables from the model: Sat, Sun, Other and then re-create the lm
```{r}
linearMod3 <- lm(HTRIPS~.-TrpPrs-SAMPN-Sat-Sun-other, data=SmallHHfile)
```
```{r residuals2}
##check_regression(linearMod)
#initial visual check of model
residuals2 <- resid(linearMod3)
summary(residuals2)
#ks.test(linearMod)
```
We see that the mean of residuals is 0, as occurs for a line of best fit.
```{r residplot2}
plot(resid(linearMod3),fitted(linearMod3))
```
It looks like there is some sort of trend in the residual plot, a trend that looks very similar to our original model, indicating that the regression line might not have been computed correctly. The data is also still heteroscedastic.
```{r plotting2}
plot(linearMod3)
```
We see that the qq-plot (plot 2) is still not normal and is skewed.
****
Since the predictor variables are still mostly categorical and a few dichotomous variables, ANOVA is a better measure than linear regression.
```{r anova2}
anova(linearMod3)
```
The difficulty with ANOVA is that it cannot speak to the value of the model as a whole, only compare individual coefficients. For these individual coefficients, everything except Friday, rural, new vehicle, owned house and single home are significant. It seems strange that rural, for instance, does not also show significance alongside the other dummy variables for that category, or only Friday of all the days.
```{r}
summary(linearMod3)
```
There are three components to pay attention to here: the F-test, the R-squared values and the individual coefficient p-values.
In this model, the F test statistic is quite high - 1063 - and highly significant, with an extremely low p-value. This tells us that our model is a significantly better fit to the data than a model without any predictor variables.
The amount of variability of the independent variable (number of trips taken per household) described by the x variables we have chosen is much better than the original model (19%), but still not incredibly high. Only about 38.5% of the data can be explained.
Of that variability explanation, the HHSIZ, HHEMP, HHSTU, HHLIC, day of week, Monday and Friday, total distance, high income, Friday and Saturday, along with HHVEH, HHBIC, new vehicle, bought car and miles per person are most significant in producing that variability.
```{r}
possible_regressions(linearMod3)
```
Both the scatterplot-fitted regression line and the reduction in SSE plots indicate that the fitted regression line are unlikely to have occurred by chance. It is very unlikely that no relationship exists between the variables.
```{r confint2}
confint(linearMod3, level = 0.95)
```
0 appeared within the confidence interval for income, Tuesday, Wednesday, Thursday, all household location variables and single home. Through this diagnostic, every other variable is significant. Most of these seem to overlap with other summaries and tests, but it is particularly noticeable that household locations don't seem to be at all significant. This aspect seems to be quite different from other models.
```{r durbin-watson2}
dwtest(linearMod3)
```
The data is serially autocorrelated, i.e. there is a time-based effect on the model. The p-value for the test is 0.003.
I'd initially noticed the NAs/singularity issue only after computing the Durbin-Watson test and getting an error about element 29 being 0 and matrix inverse being impossible to compute, so this was pretty useful.
```{r heteroscedacity2}
bptest(linearMod3)
coeftest(linearMod3,vcov=hccm(linearMod3))
```
White's test (studentized bp) is most useful when the heteroskedacity is non-linear and the error terms are not normally distributed. In this case, heteroskedacity is present (high BP; low p-value), so we should use a covariance matrix.
```{r collinear2}
VIF(linearMod3)
#generalization_error()
```
All of the household location dummy variables (center, suburb, exurb, rural) demonstrate a high VIF, indicating a significant degree of collinearity between those explanatory variables. Above, the linear regression summary didn't show those coefficients as being particularly significant anyway, so we can remove these variables also and re-create the model.
```{r}
linearMod3 <- lm(HTRIPS~.-TrpPrs-SAMPN-Sat-Sun-other-center-suburb-exurb-rural, data=SmallHHfile)
summary(linearMod3)
```
A slight decrease in the R-squared! :(
```{r aic2}
#ALL <- regsubsets(MilesPr~.^2,data=SmallHHfile,method="exhaustive",nvmax=15,nbest=4, really.big = T)
#see_models(ALL,aicc=TRUE,report=5)
extractAIC(linearMod3)
#LMstep <- step(linearMod3,direction="both",trace=TRUE)
#summary(LMstep)
```
The AIC looks to be quite high, suggesting that this is a good-quality model. Unfortunately, the functions to compare models to choose the ones with the highest AIC isn't executing properly, so I can't establish whether there are better models to be made (without a lot of manual labor). (The AIC was quite high even before removing the singular variables)
***********
##ANALYSIS
_**2.3 Write a short summary of the model in a similar fashion as our discussion in class highlighting which coefficients are significantly different than zeroand what they tell us.**_
For the first model, interestingly, the number of miles traveled (per day per person in household) is not correlated with the weekdays Monday-Thursday, but they are correlated with Friday and Saturday. For example, for each day that is Friday, 3.78 more miles will be traveled. One can speculate that people are more likely to travel during the end of the workweek and early weekend, perhaps for chores such as grocery shopping or for short weekend trips. Additionally, number of cars held by the household and number of persons in the household are both significant variables. For each additional car, 4.89 more miles will be traveled by each person in the household. The location of the home is also quite significant, regardless of whether it is suburban, rural or exurban. We did expect that the location would greatly affect the miles traveled. Number of cars and number of persons in household are also pretty expected significant variables. There were a couple more variables that we thought might be significant, but weren't, but overall, this model did not surprise.
_**3.2 Write a comparison summary between Model 1 and Model 2.**_
For the second model modeling HTRIPS, or number of trips taken per household, different days became significant - this time, Monday in addition to the earlier-significant Friday variable. Trips per person and trips per household would covary, so I removed that from my model. Here, total distance traveled by the household, whether the household has a high income, HHEMP, HHSTU, HHLIC, DOW, number of cars and bikes owned, whether new cars were bought, CARBUY (car bought in last five years), and MilesPr are very statistically significant. These are all of the variables I expected to be significant, however, I was mistaken about the locality of the households being significant. So we can see there is a good deal of overlap between the models. Some of these coefficients are surprising, for instance more vehicles correlate with fewer trips, and more licensed people correlate with fewer trips.
I would speculate that people drive greater distances over the weekend (perhaps weekend trips and such) whereas people take more trips early in the week (for instance for work).
It now makes sense that new vehicle, owned house and single home don't make a difference in the dependent variable. Although intuitively, they all would seem to be indicators of wealth and therefore freedom, but with the variation of economic circumstances, it is entirely conceivable that a household would rent a home or have limited economic capacity, but still have to conduct regular trips, e.g. to the grocery store, as per normal.
A few other surprising factors: I expected the adjusted R^2 to drop in the second model due to the addition of so many variables, but that did not occur. The inclusion of important significant x variables seems to have more than overcome that penalty. I am also surprised by the amount by which the R^2 seems to have increased by the second model (doubled). The R^2 could be improved, I think, if one added or subtracted independent variables one at a time to maximize AIC, but that is a process I have forgotten.