Skip to content

how to detect spurious regression using time-series cross-validation

Notifications You must be signed in to change notification settings

crossxwill/spurious_regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Summary

Using the forecast package, I propose an approach to detect spurious regressions using time-series cross-validation. The method is as follows:

  1. For each rolling subset of time-series data (i.e., rolling training data), fit a linear regression model and a competing naive plus drift model (i.e., random walk with drift). A naive plus drift model uses only the last data point of the training data plus a drift. To reduce typing, "random walk with drift model" and "naive plus drift model" will be referred to as "naive model".

  2. Measure the MSE of each model (i.e, regression and naive) using a rolling test data set that is h=1 period ahead of the training data.

  3. If the CV MSE of linear regression exceeds that of the naive model, then we can conclude that the regression is spurious. In other words, a regression is spurious if it fails to defeat a naive model.

It is recommended to use h=1. A large h may favor the regression model if both the response and the predictors share a strong deterministic trend.

Helper Functions

The following function wraps lm into a function that can be passed into tsCV. The function was copied from Rob Hyndman on Stack Overflow.

linRegCV <- function(y, h, xreg){
  if(NROW(xreg) < length(y) + h)
    stop("Not enough xreg data for forecasting")
  
  X <- xreg[seq_along(y),]
  fit <- tslm(y ~ X)
  X <- xreg[length(y)+seq(h),]
  fc <- forecast(fit, newdata=X, h=h)
  
  return(fc)
}

The following function runs tsCV on both a naive model and a linear regression model. Then the function computes the CV mean-squared error (CVMSE) of each model. It is recommended to use h=1.

For long time series, increase the initial parameter to a large integer. A small number of initial would increase computation time.

detectSpuriousRegCV <- function(y, h=1, xreg, initial=20){

  if(length(y) != NROW(xreg)) {
    warning('y and X are of different lengths. Data will be truncated. \n')
      
    minLength <- min(NROW(y), NROW(xreg))
    
    y <- head(y, minLength)
    xreg <- head(xreg, minLength)
    }
  
  linRegCVResiduals<- tsCV(y=y, linRegCV, h=h, xreg=xreg, initial=initial)
  naiveCVResiduals <- tsCV(y=y, rwf, h=h, initial=initial, drift=TRUE)
  
  naiveCVResiduals[is.na(linRegCVResiduals)] <- NA # leads and lags in the regression may introduce NA
  linRegCVResiduals[is.na(naiveCVResiduals)] <- NA # differencing may introduce NA

  
  MSE_lm <- mean(linRegCVResiduals^2, na.rm=TRUE)
  MSE_naive <- mean(naiveCVResiduals^2, na.rm=TRUE)
      
  out <- list(CV_MSE_lm=MSE_lm, CV_MSE_naive=MSE_naive, SpuriousRegression=MSE_lm>MSE_naive)
  
  return(out)
}

An example of a spurious regression

Is the number of Air Transport Passengers in Australia related to rice production in the country of Guinea?

data(ausair)
data(guinearice)

detectSpuriousRegCV(y=ausair, h=1, xreg=guinearice)
## Warning in detectSpuriousRegCV(y = ausair, h = 1, xreg = guinearice): y and X are of different lengths. Data will be truncated.
## $CV_MSE_lm
## [1] 21.53863
## 
## $CV_MSE_naive
## [1] 7.602087
## 
## $SpuriousRegression
## [1] TRUE

An example of a non-spurious regression

Is consumption related to income?

data(uschange)

consumption <- uschange[,'Consumption']
income <- uschange[,'Income']

detectSpuriousRegCV(y=consumption, h=1, xreg=income)
## $CV_MSE_lm
## [1] 0.3625108
## 
## $CV_MSE_naive
## [1] 0.4888498
## 
## $SpuriousRegression
## [1] FALSE

Simulating spurious regressions

Simulate 100 pairs of non-stationary time series with drift and trend.

simulateSpuriousData <- function(seed, ndrift=0.01, ntrend=0.02){
  nobs <- 60
  set.seed(seed)
  
  Xts <- ts(cumsum(ndrift + ntrend*seq_len(nobs) + rnorm(nobs))) # Random-walk with drift and trend
  Yts <- ts(cumsum(ndrift + ntrend*seq_len(nobs) + rnorm(nobs))) # Random-walk with drift and trend
  
  return(list(Xts=Xts, Yts=Yts))
}

manySpuriousData <- map(1:100, simulateSpuriousData)

For each pair, run detectSpuriousRegCV with h=3.

simulateSpuriousRegression <- map(manySpuriousData, function(lstData){
  out <- detectSpuriousRegCV(y=lstData$Yts, xreg=lstData$Xts, h=3)
  
  return(out$SpuriousRegression)
})

mean(unlist(simulateSpuriousRegression))
## [1] 0.99

Based on the simulation, detectSpuriousRegCV correctly found the spurious regression 99% of the time with h=3.

Since there is a strong deterministic trend in both pairs of time series, the h=3 parameter is poorly chosen. Use h=1 instead.

simulateSpuriousRegressionh1 <- map(manySpuriousData, function(lstData){
  out <- detectSpuriousRegCV(y=lstData$Yts, xreg=lstData$Xts, h=1)
  
  return(out$SpuriousRegression)
})

mean(unlist(simulateSpuriousRegressionh1))
## [1] 1

Based on the simulation, detectSpuriousRegCV correctly found the spurious regression 100% of the time with h=1.

Simulate another 100 pairs of non-stationary time series with very strong drift and trend.

manySpuriousDataStrongTrend <- map(1:100, simulateSpuriousData, ndrift=0.02, ntrend=0.04)

For each pair, run detectSpuriousRegCV with h=3.

simulateSpuriousRegressionStrongTrend_h3 <- map(manySpuriousDataStrongTrend, function(lstData){
  out <- detectSpuriousRegCV(y=lstData$Yts, xreg=lstData$Xts, h=3)
  
  return(out$SpuriousRegression)
})

mean(unlist(simulateSpuriousRegressionStrongTrend_h3))
## [1] 0.91

With h=3, strong deterministic trends reduce the effectiveness of detectSpuriousRegCV.

For each pair, run detectSpuriousRegCV with h=1.

simulateSpuriousRegressionStrongTrend_h1 <- map(manySpuriousDataStrongTrend, function(lstData){
  out <- detectSpuriousRegCV(y=lstData$Yts, xreg=lstData$Xts, h=1)
  
  return(out$SpuriousRegression)
})

mean(unlist(simulateSpuriousRegressionStrongTrend_h1))
## [1] 1

Based on the simulation, detectSpuriousRegCV correctly found the spurious regression 100% of the time with h=1. When detecting spurious regression with cross-validation, always set h=1.

Further work

  1. How does detectSpuriousRegCV perform on very long non-stationary time series (n > 1000)?

  2. What if $y$ and $x$ are independent random walks with no drift and no trend? Would detectSpuriousRegCV perform well?

  3. How does detectSpuriousRegCVperform on co-integrated time series? Stationary time series?

  4. How does detectSpuriousRegCV benchmark against popular co-integration tests like Engle-Granger and Phillips–Ouliaris?

  5. What if $y$ and $x$ both have integration order > 1? Would detectSpuriousRegCV perform well?

About

how to detect spurious regression using time-series cross-validation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages