-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProject2_tcm.Rmd
568 lines (364 loc) · 16.4 KB
/
Project2_tcm.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
---
title: "Project 2"
author: "Daniel, Tom, and Rachael"
date: "Date"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
require("knitr")
#sourcedir <- "D:/GoogleDrive/Julie_SYS4021/Projects/TimeSeries2020"
#datadir <- "D:/GoogleDrive/Julie_SYS4021/Data Sets/AirQualityUCI"
datadir <- "C:/Users/tcmuh/Desktop/Fall2020/SYS6021/Data/AirQualityUCI"
sourcedir <-"C:/Users/tcmuh/Desktop/Fall2020/SYS6021/Source"
opts_knit$set(root.dir = sourcedir)
library(forecast)
library(imputeTS)
library(mtsdi)
library(ggplot2)
library(lubridate)
library(tidyverse)
library(ggfortify)
library(ggpubr)
library(tseries)
```
# Load data and impute missing values
```{r cars, warning=FALSE}
setwd(datadir)
airquality = read.csv('AirQualityUCI.csv')
# replace -200 with NA
airquality[airquality == -200] <- NA
# convert integer type to numeric
intcols = c(4,5,7,8,9,10,11,12)
for(i in 1:length(intcols)){
airquality[,intcols[i]] <- as.numeric(airquality[,intcols[i]])
}
setwd(sourcedir)
# create new data frame with just NO2 and impute missing values
AQdata = airquality["NO2.GT."]
AQdata = na_interpolation(AQdata)
# aggregate to daily maxima for model building
dailyAQ <- aggregate(AQdata, by=list(as.Date(airquality[,1],"%m/%d/%Y")), FUN=max)
# create time series of NO2
NO2.ts <- ts(dailyAQ[,2])
```
First need to visualize the data:
```{r}
AQdata.ts <- ts(AQdata)
autoplot(AQdata.ts)
dailyAQ.train <- dailyAQ[1:(dim(dailyAQ)[1]-7),]
dailyAQ.ts <- ts(dailyAQ.train$NO2.GT.)
autoplot(dailyAQ.ts)
```
```{r}
spec.pgram(dailyAQ.ts,spans=9,demean=T,log='no')
```
From PG highest peak at 192 days, next one may be about 7 days.
Also looks like it may have a constant mean for the first 180-200 days and then an increasing trend.
Next step is to look at the 7-day possible trend.
Also decided to try to look at the 192-day possible trend.
Finally, also decided to put the 7 day and 192 day together on one plot.
```{r}
time.NO2 <- c(1:(length(dailyAQ.ts)))
NO2.ts.lm1 <- lm(dailyAQ.train$NO2.GT.~time.NO2+sin(2*pi*time.NO2/7)+cos(2*pi*time.NO2/7))
NO2.ts.lm2 <- lm(dailyAQ.train$NO2.GT.~time.NO2+sin(2*pi*time.NO2/192)+cos(2*pi*time.NO2/192))
NO2.ts.lm3 <- lm(dailyAQ.train$NO2.GT.~time.NO2+sin(2*pi*time.NO2/7)+cos(2*pi*time.NO2/7)+sin(2*pi*time.NO2/192)+cos(2*pi*time.NO2/192))
summary(NO2.ts.lm1)
summary(NO2.ts.lm2)
summary(NO2.ts.lm3)
ggplot(dailyAQ.train[1:(length(dailyAQ.ts)),], aes(x=time.NO2,y=NO2.GT.)) + geom_line() + geom_line(aes(x=time.NO2,y=NO2.ts.lm1$fitted.values),color="red") + xlab("") + ylab("NO2")
ggplot(dailyAQ.train[1:(length(dailyAQ.ts)),], aes(x=time.NO2,y=NO2.GT.)) + geom_line() + geom_line(aes(x=time.NO2,y=NO2.ts.lm2$fitted.values),color="red") + xlab("") + ylab("NO2")
ggplot(dailyAQ.train[1:(length(dailyAQ.ts)),], aes(x=time.NO2,y=NO2.GT.)) + geom_line() + geom_line(aes(x=time.NO2,y=NO2.ts.lm3$fitted.values),color="red") + xlab("") + ylab("NO2")
```
It looks like both of the seasonal trends may be descriptive of the time series data
since at least one of the coefficients of the sin or cos term is valid at the
0.05 level.
Next need to look at some diagnostics for these three models.
```{r}
AIC(NO2.ts.lm1)
AIC(NO2.ts.lm2)
AIC(NO2.ts.lm3)
```
As an initial look, the AIC for the time series including both the 7-day and
192-day seasons is the best so we should continue with this model until we
find something that works better.
Maybe add a 10-day season as well?
```{r}
NO2.ts.lm4 <- lm(dailyAQ.train$NO2.GT.~time.NO2+sin(2*pi*time.NO2/10)+cos(2*pi*time.NO2/10))
summary(NO2.ts.lm4)
ggplot(dailyAQ.train[1:(length(dailyAQ.ts)),], aes(x=time.NO2,y=NO2.GT.)) + geom_line() + geom_line(aes(x=time.NO2,y=NO2.ts.lm4$fitted.values),color="red") + xlab("") + ylab("NO2")
```
10 days was not useful in describing the trend.
Not sure whether this is a normal technique or not but what about looking at a
lot of different values for this? Prime numbers are probably the best to look at
first.
```{r}
i <- 364
1/i
NO2.ts.lm5 <- lm(dailyAQ.train$NO2.GT.~time.NO2+sin(2*pi*time.NO2/i)+cos(2*pi*time.NO2/i))
summary(NO2.ts.lm5)
AIC(NO2.ts.lm5)
ggplot(dailyAQ.train[1:(length(dailyAQ.ts)),], aes(x=time.NO2,y=NO2.GT.)) + geom_line() + geom_line(aes(x=time.NO2,y=NO2.ts.lm5$fitted.values),color="red") + xlab("") + ylab("NO2%%")
```
28 (AIC 3982.339) and 64 (AIC 3970.735) both look interesting
128: AIC 3988.009
77: AIC 3980.565
64*7=448: AIC 3959.162
364: AIC 3956.527
365: AIC 3956.562
Looks like 365 is a good choice. Next we'll build a model with the three timeframes
7, 192, and 365.
```{r}
NO2.ts.lm6 <- lm(dailyAQ.train$NO2.GT.~time.NO2+sin(2*pi*time.NO2/7)+cos(2*pi*time.NO2/7)+sin(2*pi*time.NO2/192)+cos(2*pi*time.NO2/192)+sin(2*pi*time.NO2/365)+cos(2*pi*time.NO2/365))
summary(NO2.ts.lm6)
AIC(NO2.ts.lm6)
ggplot(dailyAQ.train[1:(length(dailyAQ.ts)),], aes(x=time.NO2,y=NO2.GT.)) + geom_line() + geom_line(aes(x=time.NO2,y=NO2.ts.lm6$fitted.values),color="red") + xlab("") + ylab("NO2")
```
The AIC dropped significantly for this model. Maybe try it with just 7-day and
365-day seasons to see if that is just as good.
```{r}
NO2.ts.lm7 <- lm(dailyAQ.train$NO2.GT.~time.NO2+sin(2*pi*time.NO2/7)+cos(2*pi*time.NO2/7)+sin(2*pi*time.NO2/365)+cos(2*pi*time.NO2/365))
summary(NO2.ts.lm7)
AIC(NO2.ts.lm7)
ggplot(dailyAQ.train[1:(length(dailyAQ.ts)),], aes(x=time.NO2,y=NO2.GT.)) + geom_line() + geom_line(aes(x=time.NO2,y=NO2.ts.lm7$fitted.values),color="red") + xlab("") + ylab("NO2")
```
AIC was better with all three seasonal terms (7-, 192-, and 365-day). We can explain
the 7- and 365-day intervals but the 192-day is a mystery (it is one half the length
of the data set of 384 observations). It is possible that this will lead to
overfitting the data and may perform worse on the test data.
Recommend we look at both the 7 and 365-day seasons for sure and compare the results
to the model using the 192-day season as well.
Lets take a look at the residuals now.
```{r}
plot(NO2.ts.lm6$residuals)
```
The residuals don't look too bad although they may be growing slightly in magnitude
from the start to the end. I think they appear to be zero-mean.
Next step is to model the residuals.
```{r}
AQresiduals <- NO2.ts.lm6$residuals
AQresiduals.ts <- ts(AQresiduals)
autoplot(AQresiduals.ts)
```
Is there a pattern to try to model with the residuals? I think so but let's look
at the ACF and PACF and see what might be in there.
```{r}
AQres.acf <- ggAcf(AQresiduals.ts)
AQres.pacf <- ggPacf(AQresiduals.ts)
ggarrange(AQres.acf,AQres.pacf,nrow=2,ncol=1)
```
From the ACF, the residuals may not be stationary. From the PACF, there are many
that are significant: at 1, 3, 7, 12, and 15 (not sure about 2).
The next step is to do an auto ARIMA model and see what comes out.
```{r}
AQres.auto <- auto.arima(AQresiduals.ts,approximation=FALSE)
summary(AQres.auto)
```
So no difference was needed, only a (3,0,2) model with 3 autocorrelation terms and
2 moving average terms.
The next step will be to see if I can improve on that with some different models
from the ACF and PACF graphs. Both the ACF and PACF look like they are somewhat
sinusoidal (PACF and ACF) and/or exponential (ACF) so an ARMA model is likely the
best choice.
I think I'll try a (2,1,2) model next and then a (3,1,2) model and see how they do
with AIC.
```{r}
AQres.212 <- arima(AQresiduals.ts, order=c(2,1,2), include.mean=FALSE)
summary(AQres.212)
AQres.312 <- arima(AQresiduals.ts, order=c(3,1,2), include.mean=FALSE)
summary(AQres.312)
AQres.303 <- arima(AQresiduals.ts, order=c(3,0,3), include.mean=FALSE)
summary(AQres.303)
AQres.313 <- arima(AQresiduals.ts, order=c(3,1,3), include.mean=FALSE)
summary(AQres.313)
AQres.304 <- arima(AQresiduals.ts, order=c(3,0,4), include.mean=FALSE)
summary(AQres.304)
```
After all those tests, the (3,0,3) model did best based on AIC:
(3,0,2) AIC: 3741.84
(3,1,3) AIC: 3738.84
(3,0,3) AIC: 3732.94
Next step is to look at the diagnostic plots.
```{r}
model1 = ggplot() + geom_point(aes(x=fitted(AQres.313), y=AQres.313$residuals)) + ggtitle("ARIMA (3,1,3)")
model2 = ggplot() + geom_point(aes(x=fitted(AQres.auto), y=AQres.auto$residuals)) + ggtitle("Auto (3,0,2)")
model3 = ggplot() + geom_point(aes(x=fitted(AQres.303), y=AQres.303$residuals)) + ggtitle("ARIMA (3,0,3)")
model4 = ggplot() + geom_point(aes(x=fitted(AQres.312), y=AQres.312$residuals)) + ggtitle("ARIMA (3,1,2)")
ggarrange(model1, model2, model3, model4, ncol=2, nrow=2)
```
No issues with the residuals vs. fitted.
```{r}
model1 = qplot(sample=AQres.313$residuals) + stat_qq_line(color="red") + ggtitle("ARIMA (3,1,3)")
model2 = qplot(sample=AQres.auto$residuals) + stat_qq_line(color="red") + ggtitle("Auto (3,0,2)")
model3 = qplot(sample=AQres.303$residuals) + stat_qq_line(color="red") + ggtitle("ARIMA (3,0,3)")
model4 = qplot(sample=AQres.312$residuals) + stat_qq_line(color="red") + ggtitle("ARIMA (3,1,2)")
ggarrange(model1, model2, model3, model4, ncol=2, nrow=2)
```
No problems with the Q-Q plots although the (3,0,3) looks the best as it doesn't
tail off as much at the upper end.
```{r}
# Diagnostics for independence
ggtsdiag(AQres.313,gof.lag=20)
ggtsdiag(AQres.auto,gof.lag=20)
ggtsdiag(AQres.303,gof.lag=20)
ggtsdiag(AQres.312,gof.lag=20)
```
The first three models look about the same with similar lags~=20. The last only
has about 11 lags of independence.
We'll continue with the best model so far: the ARIMA (3,0,3) model.
Look at the ACF and PACF of the residuals.
```{r}
AQres.303.acf <- ggAcf(AQres.303$residuals)
AQres.303.pacf <- ggPacf(AQres.303$residuals)
ggarrange(AQres.303.acf,AQres.303.pacf,nrow=2,ncol=1)
```
The spikes in the ACF and PACF at lag 11 are strange but the rest of the plots are
within the +/-0.1 bounds.
Next we need to look at the plot of the predicted vs. actual residuals.
```{r}
AQdata.fitted <- NO2.ts.lm6$fitted.values + fitted(AQres.auto)
ggplot() + geom_line(aes(x=time.NO2,y=NO2.ts[1:length(time.NO2)],color="True")) +
geom_line(aes(x=time.NO2,y=AQdata.fitted,color="Fitted")) + xlab("Time") +
ylab("Daily NO2 level")
```
The Fitted plot looks pretty good, not as extreme at some points but generally
trending close to the True value.
The next step is to use the model to predict the last 7 days that we didn't use
for training the model.
```{r}
AQdata.best.forecast <- forecast(AQres.303, h=7)
plot(AQdata.best.forecast)
next.7day.time <- c(385:391)
dailyAQ.all.ts <- ts(dailyAQ)
# The test data frame
next.7day <- data.frame(time.NO2 = next.7day.time, NO2.GT. = NO2.ts[next.7day.time])
# The actual time series for the test period
next.7day.ts <- dailyAQ.ts[next.7day.time]
#next.7day.ts <- ts(next.7day$NO2.GT.)
# Prediction for the next 7 days by ARIMA(3,0,3) model:
E_Y.pred <- predict(NO2.ts.lm6, newdata=next.7day)
e_t.pred <- forecast(AQres.303, h=7)
next.7day.prediction <- E_Y.pred + e_t.pred$mean
# MSE:
mean((next.7day.prediction-next.7day$NO2.GT.)^2)
time.predictions <- dailyAQ$Group.1[(length(time.NO2)+1):(length(time.NO2)+7)]
model1.predictions <- ggplot() +
geom_line(aes(x=time.predictions,y=next.7day$NO2.GT.),color="black") +
geom_line(aes(x=time.predictions,y=next.7day.prediction),color="red") +
geom_line(aes(x=time.predictions,y=E_Y.pred + e_t.pred$lower[,2]),
color="red",linetype="dashed") +
geom_line(aes(x=time.predictions,y=E_Y.pred + e_t.pred$upper[,2]),
color="red",linetype="dashed") +
xlab("") + ylab("NO2") +
ggtitle("NO2 Trend + Seasonal Model + ARIMA of Residuals")
model1.predictions
```
That looks pretty good. The entire True line falls inside the lower and upper bounds
of the prediction model.
I want to redo that part with the auto.arima model as well and see how it did.
```{r}
# Prediction for the next 7 days by ARIMA(3,0,2) model:
E_Y.pred <- predict(NO2.ts.lm6, newdata=next.7day)
e_t.pred <- forecast(AQres.auto, h=7)
next.7day.prediction <- E_Y.pred + e_t.pred$mean
# MSE:
mean((next.7day.prediction-next.7day$NO2.GT.)^2)
time.predictions <- dailyAQ$Group.1[(length(time.NO2)+1):(length(time.NO2)+7)]
model.auto.prediction <- ggplot() +
geom_line(aes(x=time.predictions,y=next.7day$NO2.GT.),color="black") +
geom_line(aes(x=time.predictions,y=next.7day.prediction),color="red") +
geom_line(aes(x=time.predictions,y=E_Y.pred + e_t.pred$lower[,2]),
color="red",linetype="dashed") +
geom_line(aes(x=time.predictions,y=E_Y.pred + e_t.pred$upper[,2]),
color="red",linetype="dashed") +
xlab("") + ylab("NO2") +
ggtitle("NO2 Trend + Seasonal Model + ARIMA of Residuals")
model.auto.prediction
```
MSE is better on the ARIMA(3,0,3) model so we'll stick with that model as we go
to the next section.
End of part 1
Part 2
Simulate a year's worth of NO2 data.
```{r}
# Simulate 365 days of NO2 with the best model
set.seed(1)
auto.sim <- arima.sim(n=365, list(ar=c(AQres.303$coef[1],AQres.303$coef[2],AQres.303$coef[3]),
ma=c(AQres.303$coef[4],AQres.303$coef[5],AQres.303$coef[6])),
sd=sqrt(AQres.303$sigma2))
# Add mean predictions and plot simulation of NO2 level
next.365day.time <- c(1:365)
next.365day <- data.frame(time.NO2 = next.365day.time)
next.365day.predictions <- predict(NO2.ts.lm6, newdata=next.365day)
# plot simulated NO2 level vs actual
b=autoplot(ts(next.365day.predictions + auto.sim),xlab="Time",ylab="Simulated NO2 Level")
a=autoplot(dailyAQ.ts, xlab="Time", ylab="Actual NO2 level")
ggarrange(a,b,nrow=2,ncol=1)
```
The two sets look similar but I'm not sure if this is for the next 365 days or for
just a random 365-day interval.
The next step is to look at the periodogram of the original data compared to the
periodogram of the simulated data.
```{r}
simAQ.ts <- next.365day.predictions + auto.sim
spec.pgram(dailyAQ.ts,spans=9,demean=T,log='no')
spec.pgram(simAQ.ts,spans=9,demean=T,log='no')
```
The two look similar except the peaks are slightly different magnitude in the
simulated data and there is a new peak in the 0.05-0.06 range.
Next step is to model the simulated data and see if the trend, seasonality, and
residuals all look similar.
```{r}
simAQ <- as.numeric(simAQ.ts)
NO2sim.ts.lm6 <- lm(simAQ~next.365day.time+sin(2*pi*next.365day.time/7)
+cos(2*pi*next.365day.time/7)+sin(2*pi*next.365day.time/192)
+cos(2*pi*next.365day.time/192)+sin(2*pi*next.365day.time/365)
+cos(2*pi*next.365day.time/365))
summary(NO2sim.ts.lm6)
AIC(NO2sim.ts.lm6)
NO2.ts.lm6$coefficients[2]
NO2sim.ts.lm6$coefficients[2]
abs(NO2.ts.lm6$coefficients[2]-NO2sim.ts.lm6$coefficients[2])/NO2.ts.lm6$coefficients[2]
ggplot(simAQ.ts, aes(x=next.365day.time,y=simAQ)) + geom_line() + geom_line(aes(x=next.365day.time,y=NO2sim.ts.lm6$fitted.values),color="red") + xlab("") + ylab("NO2")
```
The coefficients are in the same range although the coefficients on time in the
simulation is somewhat distant, 24.1% different (using (actual-sim)/actual) from
the coefficient on the actual time series. With the somewhat random error, this
is not unexpected.
Next is to find the percent difference in the mean and variance of the real time
series and the simuated one.
```{r}
mean(dailyAQ.ts)
var(dailyAQ.ts)
mean(simAQ.ts)
var(simAQ.ts)
abs(mean(dailyAQ.ts)-mean(simAQ.ts))/mean(dailyAQ.ts)
abs(var(dailyAQ.ts)-var(simAQ.ts))/var(dailyAQ.ts)
```
The mean and variance are quite a bit better than the time coefficients with the
percent difference between the means at .46% and the percent difference between
the variances at 6.1%.
Next we'll compare the ACF and PACF between the actual data and the simulated data.
```{r}
AQsim.residuals <- NO2sim.ts.lm6$residuals
AQsim.residuals.ts <- ts(AQsim.residuals)
AQsim.res.acf <- ggAcf(AQsim.residuals.ts)
AQsim.res.pacf <- ggPacf(AQsim.residuals.ts)
ggarrange(AQres.acf,AQsim.res.acf,AQres.pacf,AQsim.res.pacf,nrow=2,ncol=2)
```
The simulated ACF and PACF do not exactly the same as the actual ACF and PACF but
we should see what the auto.arima function says about the simulated residuals.
```{r}
AQsim.res.auto <- auto.arima(AQsim.residuals.ts,approximation=FALSE)
summary(AQres.auto)
summary(AQsim.res.auto)
AQsim.res.303 <- arima(AQsim.residuals.ts, order=c(3,0,3), include.mean=FALSE)
summary(AQres.303)
summary(AQsim.res.303)
```
What is interesting is that the auto.arima gave (3,0,2) for the actual data and
(1,0,0) for the simulated data but the AIC for the (3,0,3) model of the residuals
is actually better for both of them.
All in all, the simulation is a pretty good estimate of a series similar to the
actual data we originally modeled.