Learning goals:
* fit regression models with a single numerical predictor and multiple
categorical predictors
* correctly interpret dummy variables and interaction terms in linear
regression models
* correctly interpret an ANOVA table in a model with correlated
predictors
Data files:
* house.csv: sales
prices of houses in Kansas City
Let's start by loading the mosaic library, reading in the house-price data set and summarizing the variables.
library(mosaic)
summary(house)
## home nbhd offers sqft brick
## Min. : 1.00 nbhd01:44 Min. :1.000 Min. :1450 No :86
## 1st Qu.: 32.75 nbhd02:45 1st Qu.:2.000 1st Qu.:1880 Yes:42
## Median : 64.50 nbhd03:39 Median :3.000 Median :2000
## Mean : 64.50 Mean :2.578 Mean :2001
## 3rd Qu.: 96.25 3rd Qu.:3.000 3rd Qu.:2140
## Max. :128.00 Max. :6.000 Max. :2590
## bedrooms bathrooms price
## Min. :2.000 Min. :2.000 Min. : 69100
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:111325
## Median :3.000 Median :2.000 Median :125950
## Mean :3.023 Mean :2.445 Mean :130427
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:148250
## Max. :5.000 Max. :4.000 Max. :211200
Although there are a lot of variables in this data set, we will focus on
four:
* price: the sales price of the house.
* sqrt: the size of the house in square feet.
* nbhd: a categorical variable indicating which of three neighborhoods
the house is in.
* brick: a categorical variable indicating whether the house is made of
brick.
We'll begin by fitting a regression line for the price of the house in terms of its square footage:
plot(price~sqft, data=house, pch=19)
lm0 = lm(price~sqft, data=house)
abline(lm0)
coef(lm0)
## (Intercept) sqft
## -10091.12991 70.22632
According to the estimated slope of this model, each additional square foot costs roughly $70.
However, the following two plots might give you cause for concern about this answer.
bwplot(price ~ nbhd, data=house)
bwplot(sqft ~ nbhd, data=house)
We see that the both the prices and sizes of houses differ systematically across neighborhoods. Might the neighborhood be a confounding variable that distorts our estimate of the size-versus-price relationship? For example, some neighborhoods might be more desirable because of their location, not merely because of the size of the houses there.
Let's look at the neighborhoods individually to get a sense of whether this is plausible. First, neighborhood 1:
plot(price~sqft, data=subset(house, nbhd=='nbhd01'), pch=19)
lm1 = lm(price~sqft, data=subset(house, nbhd=='nbhd01'))
abline(lm1)
coef(lm1)
## (Intercept) sqft
## 32906.42259 40.30018
Within neighborhood 1 alone, it looks like each additional square costs about $40. How about neighborhood 2?
plot(price~sqft, data=subset(house, nbhd=='nbhd02'), pch=19)
lm2 = lm(price~sqft, data=subset(house, nbhd=='nbhd02'))
abline(lm2)
coef(lm2)
## (Intercept) sqft
## 25682.1102 49.4285
Here the size premium is about $50 per square foot. And neighborhood 3?
plot(price~sqft, data=subset(house, nbhd=='nbhd03'), pch=19)
lm3 = lm(price~sqft, data=subset(house, nbhd=='nbhd03'))
abline(lm3)
coef(lm3)
## (Intercept) sqft
## 56659.14762 49.32586
Also about $50 per square foot. So let's recap:
* In each individual neighborhood, the price of an additional square
foot is between 40 and 50 dollars.
* Yet for all three neighborhoods together, the price of an additional
square foot is 70 dollars.
This is a classic example of an aggregation paradox: that is, something which appears to hold for a group (all three neighborhoods together) simultaneously fails to hold for the individual members of that group. The following picture may give you some intuition for what's going on here. We will plot the points for the individual neighborhoods in different colors:
# Plot the whole data set
plot(price~sqft, data=house)
# Color the points and add the line for nbhd 1
points(price~sqft, data=subset(house, nbhd=='nbhd01'), pch=19, col='blue')
abline(lm1, col='blue')
# Color the points and add the line for nbhd 2
points(price~sqft, data=subset(house, nbhd=='nbhd02'), pch=19, col='red')
abline(lm2, col='red')
# Color the points and add the line for nbhd 3
points(price~sqft, data=subset(house, nbhd=='nbhd03'), pch=19, col='grey')
abline(lm3, col='grey')
# Finally, add the "global" line
abline(lm0, lwd=4)
You can see that the lines for the individual neighborhoods are all less steep than the overall line for the aggregrated data set. This suggests that neighborhood is indeed a confounder for the price-versus-size relationship.
To resolve the aggregation paradox in the house-price data set, we applied a "split and fit" strategy:
- Split the data into subsets, one for each group. 2) Fit a separate model to each subset.
With only a single grouping variable, the "split-and-fit" strategy often works just fine. But with multiple grouping variables, it gets cumbersome quickly. Therefore, we'll learn an alternative strategy that will prove to be much more useful than split-and-fit: dummy variables and interactions.
Remember that a dummy variable is a 0/1 indicator of membership in a particular group. Here's how we introduce dummy variables in a regression model.
lm4 = lm(price ~ sqft + nbhd, data=house)
coef(lm4)
## (Intercept) sqft nbhdnbhd02 nbhdnbhd03
## 21241.17443 46.38592 10568.69781 41535.30643
This output says that there are three different lines for the three
different neighborhoods:
* Neighborhood 1 (the baseline): price = 21241 + 46.39 * sqft
* Neighborhood 2: price = (21241 + 10569) + 46.39 * sqft
* Neighborhood 3: price = (21241 + 41535) + 46.39 * sqft
That is, three different lines with three different intercepts and the same slope (46.39). The coefficient labeled "(Intercept)" is the intercept for the baseline category (in this case, neighborhood 1). The coefficients on the nbhd02 and nbhd03 dummy variables are the offsets.
If we believe that the price-versus-size relationship is different for each neighborhood, we may want to introduce an interaction term:
lm5 = lm(price ~ sqft + nbhd + nbhd:sqft, data=house)
coef(lm5)
## (Intercept) sqft nbhdnbhd02 nbhdnbhd03
## 32906.422594 40.300183 -7224.312425 23752.725029
## sqft:nbhdnbhd02 sqft:nbhdnbhd03
## 9.128318 9.025674
Now we're allowing both the slope and intercept to differ from
neighborhood to neighborhood. The rules are:
* The coefficients on the dummy variables get added to the baseline
intercept to form each neighborhood-specific intercept.
* The coefficients on the interaction terms get added to the baseline
slope to form each neighborhood-specific slope.
Thus our model above output says that:
* Neighborhood 1 (the baseline): price = 32906 + 40.30 * sqft
* Neighborhood 2: price = (32906 - 7224) + (40.30 + 9.13) * sqft
* Neighborhood 3: price = (32906 + 23753) + (40.30 + 9.02) * sqft
Once you've got the idea of dummy variables and interactions, you can add as many categorical variables as you deem appropriate. For example, consider the following model:
lm6 = lm(price ~ sqft + nbhd + brick + brick:sqft, data=house)
coef(lm6)
## (Intercept) sqft nbhdnbhd02 nbhdnbhd03 brickYes
## 28434.99926 41.27425 5447.01983 36547.03947 -16787.14206
## sqft:brickYes
## 17.85384
Here there are offsets for neighborhoods 2 and 3, as well as for brick houses (brick = Yes). There are offsets with respect to the baseline case of non-brick houses in neighborhood 1.
In the walkthrough on reaction time in video games, we learned that an analysis of variance can be used to partition variation in the outcome among the individual predictor variables in a regression model. An ANOVA table is constructed by adding variables to the model sequentially, and tracking the amount by which the predictive ability of the model improves at each stage. We measure the improvement by change in predictable variation (PV), or equivalently the reduction in the residual sums of squares (unpredictable variation, or UV).
Let's first load in the simple_anova
command that we used on the
video-games data. This is in the utilities file on my website:
# Load some useful utility functions
source('http://jgscott.github.io/teaching/r/utils/class_utils.R')
If we run an analysis of variance on our final model above, we get the following table.
lm6 = lm(price ~ sqft + nbhd + brick + brick:sqft, data=house)
simple_anova(lm6)
## Df R2 R2_improve sd sd_improve pval
## Intercept 1 0.00000 26869
## sqft 1 0.30579 0.30579 22476 4393.2 0.00000
## nbhd 2 0.68505 0.37926 15260 7215.4 0.00000
## brick 1 0.79026 0.10521 12504 2756.6 0.00000
## sqft:brick 1 0.79420 0.00393 12436 67.1 0.12945
## Residuals 122
It looks as though the neighborhood leads to the largest improvement in predictive power (sd_improve = 7215), followed by sqft (4393) and then brick (2756).
But what if we arbitrarily change the order in which we add the variables?
lm6alt = lm(price ~ brick + nbhd + sqft + brick:sqft, data=house)
simple_anova(lm6alt)
## Df R2 R2_improve sd sd_improve pval
## Intercept 1 0.00000 26869
## brick 1 0.20504 0.20504 24051 2817.6 0.00000
## nbhd 2 0.67161 0.46656 15582 8468.7 0.00000
## sqft 1 0.79026 0.11866 12504 3078.9 0.00000
## brick:sqft 1 0.79420 0.00393 12436 67.1 0.12945
## Residuals 122
Now the importance of brick and neighborhood looks much larger, and the importance of size a bit smaller. But the coefficients in the two models are exactly the same:
coef(lm6)
## (Intercept) sqft nbhdnbhd02 nbhdnbhd03 brickYes
## 28434.99926 41.27425 5447.01983 36547.03947 -16787.14206
## sqft:brickYes
## 17.85384
coef(lm6alt)
## (Intercept) brickYes nbhdnbhd02 nbhdnbhd03 sqft
## 28434.99926 -16787.14206 5447.01983 36547.03947 41.27425
## brickYes:sqft
## 17.85384
When the predictor variables are correlated with each other, an analysis of variance for a regression model---but not the model itself---will depend upon the order in which those variables are added. This is because the first predictor your add greedily takes credit for all the information it shares in common with any other predictors that are subsequently added to the model.
The moral of the story is that there is no such thing as "the" ANOVA table for a regression model with correlated predictors. There are multiple ANOVA tables, one for each possible ordering of the variables. Thus there is no unique way to unambiguously assign credit to individual variables in the model.
You can use a lattice plot to reproduce my "split and fit" strategy from
above: that is, split the data into subsets and fit a line to each one.
Here's one way that involves defining a custom "panel function" that is
used by xyplot
.
# Define a custom plotting function to be applied to each panel
plot_with_lines = function(x, y) {
panel.xyplot(x, y)
model_for_panel = lm(y~x)
panel.abline(model_for_panel)
}
# Pass this custom plotting function to xyplot
xyplot(price ~ sqft | nbhd, data=house, panel=plot_with_lines)