Using the forecast
package, I propose an approach to detect spurious regressions using time-series cross-validation. The method is as follows:
-
For each rolling subset of time-series data (i.e., rolling training data), fit a linear regression model and a competing naive plus drift model (i.e., random walk with drift). A naive plus drift model uses only the last data point of the training data plus a drift. To reduce typing, "random walk with drift model" and "naive plus drift model" will be referred to as "naive model".
-
Measure the MSE of each model (i.e, regression and naive) using a rolling test data set that is
h=1
period ahead of the training data. -
If the CV MSE of linear regression exceeds that of the naive model, then we can conclude that the regression is spurious. In other words, a regression is spurious if it fails to defeat a naive model.
It is recommended to use h=1
. A large h
may favor the regression model if both the response and the predictors share a strong deterministic trend.
The following function wraps lm
into a function that can be passed into tsCV
. The function was copied from Rob Hyndman on Stack Overflow.
linRegCV <- function(y, h, xreg){
if(NROW(xreg) < length(y) + h)
stop("Not enough xreg data for forecasting")
X <- xreg[seq_along(y),]
fit <- tslm(y ~ X)
X <- xreg[length(y)+seq(h),]
fc <- forecast(fit, newdata=X, h=h)
return(fc)
}
The following function runs tsCV
on both a naive model and a linear regression model. Then the function computes the CV mean-squared error (CVMSE) of each model. It is recommended to use h=1
.
For long time series, increase the initial
parameter to a large integer. A small number of initial
would increase computation time.
detectSpuriousRegCV <- function(y, h=1, xreg, initial=20){
if(length(y) != NROW(xreg)) {
warning('y and X are of different lengths. Data will be truncated. \n')
minLength <- min(NROW(y), NROW(xreg))
y <- head(y, minLength)
xreg <- head(xreg, minLength)
}
linRegCVResiduals<- tsCV(y=y, linRegCV, h=h, xreg=xreg, initial=initial)
naiveCVResiduals <- tsCV(y=y, rwf, h=h, initial=initial, drift=TRUE)
naiveCVResiduals[is.na(linRegCVResiduals)] <- NA # leads and lags in the regression may introduce NA
linRegCVResiduals[is.na(naiveCVResiduals)] <- NA # differencing may introduce NA
MSE_lm <- mean(linRegCVResiduals^2, na.rm=TRUE)
MSE_naive <- mean(naiveCVResiduals^2, na.rm=TRUE)
out <- list(CV_MSE_lm=MSE_lm, CV_MSE_naive=MSE_naive, SpuriousRegression=MSE_lm>MSE_naive)
return(out)
}
Is the number of Air Transport Passengers in Australia related to rice production in the country of Guinea?
data(ausair)
data(guinearice)
detectSpuriousRegCV(y=ausair, h=1, xreg=guinearice)
## Warning in detectSpuriousRegCV(y = ausair, h = 1, xreg = guinearice): y and X are of different lengths. Data will be truncated.
## $CV_MSE_lm
## [1] 21.53863
##
## $CV_MSE_naive
## [1] 7.602087
##
## $SpuriousRegression
## [1] TRUE
Is consumption related to income?
data(uschange)
consumption <- uschange[,'Consumption']
income <- uschange[,'Income']
detectSpuriousRegCV(y=consumption, h=1, xreg=income)
## $CV_MSE_lm
## [1] 0.3625108
##
## $CV_MSE_naive
## [1] 0.4888498
##
## $SpuriousRegression
## [1] FALSE
Simulate 100 pairs of non-stationary time series with drift and trend.
simulateSpuriousData <- function(seed, ndrift=0.01, ntrend=0.02){
nobs <- 60
set.seed(seed)
Xts <- ts(cumsum(ndrift + ntrend*seq_len(nobs) + rnorm(nobs))) # Random-walk with drift and trend
Yts <- ts(cumsum(ndrift + ntrend*seq_len(nobs) + rnorm(nobs))) # Random-walk with drift and trend
return(list(Xts=Xts, Yts=Yts))
}
manySpuriousData <- map(1:100, simulateSpuriousData)
For each pair, run detectSpuriousRegCV
with h=3
.
simulateSpuriousRegression <- map(manySpuriousData, function(lstData){
out <- detectSpuriousRegCV(y=lstData$Yts, xreg=lstData$Xts, h=3)
return(out$SpuriousRegression)
})
mean(unlist(simulateSpuriousRegression))
## [1] 0.99
Based on the simulation, detectSpuriousRegCV
correctly found the spurious regression 99% of the time with h=3
.
Since there is a strong deterministic trend in both pairs of time series, the h=3
parameter is poorly chosen. Use h=1
instead.
simulateSpuriousRegressionh1 <- map(manySpuriousData, function(lstData){
out <- detectSpuriousRegCV(y=lstData$Yts, xreg=lstData$Xts, h=1)
return(out$SpuriousRegression)
})
mean(unlist(simulateSpuriousRegressionh1))
## [1] 1
Based on the simulation, detectSpuriousRegCV
correctly found the spurious regression 100% of the time with h=1
.
Simulate another 100 pairs of non-stationary time series with very strong drift and trend.
manySpuriousDataStrongTrend <- map(1:100, simulateSpuriousData, ndrift=0.02, ntrend=0.04)
For each pair, run detectSpuriousRegCV
with h=3
.
simulateSpuriousRegressionStrongTrend_h3 <- map(manySpuriousDataStrongTrend, function(lstData){
out <- detectSpuriousRegCV(y=lstData$Yts, xreg=lstData$Xts, h=3)
return(out$SpuriousRegression)
})
mean(unlist(simulateSpuriousRegressionStrongTrend_h3))
## [1] 0.91
With h=3
, strong deterministic trends reduce the effectiveness of detectSpuriousRegCV
.
For each pair, run detectSpuriousRegCV
with h=1
.
simulateSpuriousRegressionStrongTrend_h1 <- map(manySpuriousDataStrongTrend, function(lstData){
out <- detectSpuriousRegCV(y=lstData$Yts, xreg=lstData$Xts, h=1)
return(out$SpuriousRegression)
})
mean(unlist(simulateSpuriousRegressionStrongTrend_h1))
## [1] 1
Based on the simulation, detectSpuriousRegCV
correctly found the spurious regression 100% of the time with h=1
. When detecting spurious regression with cross-validation, always set h=1
.
-
How does
detectSpuriousRegCV
perform on very long non-stationary time series (n > 1000)? -
What if
$y$ and$x$ are independent random walks with no drift and no trend? WoulddetectSpuriousRegCV
perform well? -
How does
detectSpuriousRegCV
perform on co-integrated time series? Stationary time series? -
How does
detectSpuriousRegCV
benchmark against popular co-integration tests like Engle-Granger and Phillips–Ouliaris? -
What if
$y$ and$x$ both have integration order > 1? WoulddetectSpuriousRegCV
perform well?