\(R^2\) is simply \[
+We can use the proportion reduction in prediction error from adding those covariates to measure how much those covariates improve the regression’s predictive ability. This value, called the coefficient of determination or \(R^2\), is simply \[
R^2 = \frac{TSS - SSR}{TSS} = 1-\frac{SSR}{TSS},
\] which is the reduction in error moving from \(\overline{Y}\) to \(\X_i'\bhat\) as the predictor relative to the prediction error using \(\overline{Y}\). We can think of this as the fraction of the total prediction error eliminated by using \(\X_i\) to predict \(Y_i\). One thing to note is that OLS will always improve in-sample fit so that \(TSS \geq SSR\) even if \(\X_i\) is unrelated to \(Y_i\). This phantom improvement occurs because the whole point of OLS is to minimize the SSR, and it will do that even if it is just chasing noise.
Since regression always improves in-sample fit, \(R^2\) will fall between 0 and 1. A value 0 zero would indicate exactly 0 estimated coefficients on all covariates (except the intercept) so that \(Y_i\) and \(\X_i\) are perfectly orthogonal in the data (this is very unlikely to occur because there will likely be some minimal but nonzero relationship by random chance). A value of 1 indicates a perfect linear fit.
@@ -436,7 +436,7 @@ Then we can write the above system of equations as \[
\mb{Y} = \mathbb{X}\bfbeta + \mb{e},
-\] where notice now that \(\mathbb{X}\) is a \(n \times (k+1)\) matrix and \(\bfbeta\) is a \(k+1\) length column vector.
+\] where notice now that \(\mathbb{X}\) is an \(n \times (k+1)\) matrix and \(\bfbeta\) is a \(k+1\) length column vector.
A critical link between the definition of OLS above to the matrix notation comes from representing sums in matrix form. In particular, we have \[
\begin{aligned}
\sum_{i=1}^n \X_i\X_i' &= \Xmat'\Xmat \\
@@ -472,7 +472,7 @@
6.4 Rank, linear independence, and multicollinearity
-When introducing the OLS estimator, we noted that it would exist when \(\sum_{i=1}^n \X_i\X_i'\) is positive definite or that there is “no multicollinearity.” This assumption is equivalent to saying the matrix \(\mathbb{X}\) is full column rank, meaning that \(\text{rank}(\mathbb{X}) = (k+1)\), where \(k+1\) is the number of columns of \(\mathbb{X}\). Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and linear independence means that if \(\mathbb{X}\mb{b} = 0\) if and only if \(\mb{b}\) is a column vector of 0s. In other words, we have \[
+When introducing the OLS estimator, we noted that it would exist when \(\sum_{i=1}^n \X_i\X_i'\) is positive definite or that there is “no multicollinearity.” This assumption is equivalent to saying that the matrix \(\mathbb{X}\) is full column rank, meaning that \(\text{rank}(\mathbb{X}) = (k+1)\), where \(k+1\) is the number of columns of \(\mathbb{X}\). Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and linear independence means that \(\mathbb{X}\mb{b} = 0\) if and only if \(\mb{b}\) is a column vector of 0s. In other words, we have \[
b_{1}\mathbb{X}_{1} + b_{2}\mathbb{X}_{2} + \cdots + b_{k+1}\mathbb{X}_{k+1} = 0 \quad\iff\quad b_{1} = b_{2} = \cdots = b_{k+1} = 0,
\] where \(\mathbb{X}_j\) is the \(j\)th column of \(\mathbb{X}\). Thus, full column rank says that all the columns are linearly independent or that there is no “multicollinearity.”
How could this be violated? Suppose we accidentally included a linear function of one variable so that \(\mathbb{X}_2 = 2\mathbb{X}_1\). Then we have, \[
@@ -481,7 +481,7 @@ \(b_3 = b_4 = \cdots = b_{k+1} = 0\) and \(b_1 = -2b_2\). Thus, the collection of columns is linearly dependent, so we know that the rank of \(\mathbb{X}\) must be less than full column rank (that is, less than \(k+1\)). Hopefully, it is also clear that if we removed the problematic column \(\mathbb{X}_2\), the resulting matrix would have \(k\) linearly independent columns, implying that \(\mathbb{X}\) is rank \(k\).
-Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that \(\Xmat\) if of full column rank if and only if \(\Xmat'\Xmat\) is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of \(\Xmat\) being linearly independent means that the inverse \((\Xmat'\Xmat)^{-1}\) exists and so does \(\bhat\). Further, this full rank condition also implies that \(\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'\) is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.
+Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that \(\Xmat\) is of full column rank if and only if \(\Xmat'\Xmat\) is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of \(\Xmat\) being linearly independent means that the inverse \((\Xmat'\Xmat)^{-1}\) exists and so does \(\bhat\). Further, this full rank condition also implies that \(\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'\) is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.
What are common situations that lead to violations of no multicollinearity? We have seen one above, with one variable being a linear function of another. But this problem can come out in more subtle ways. Suppose that we have a set of dummy variables corresponding to a single categorical variable, like the region of the country. In the US, this might mean we have \(X_{i1} = 1\) for units in the West (0 otherwise), \(X_{i2} = 1\) for units in the Midwest (0 otherwise), \(X_{i3} = 1\) for units in the South (0 otherwise), and \(X_{i4} = 1\) for units in the Northeast (0 otherwise). Each unit has to be in one of these four regions, so there is a linear dependence between these variables, \[
X_{i4} = 1 - X_{i1} - X_{i2} - X_{i3}.
\] That is, if I know that you are not in the West, Midwest, or South regions, I know that you are in the Northeast. We would get a linear dependence if we tried to include all of these variables in our regression with an intercept. (Note the 1 in the relationship between \(X_{i4}\) and the other variables, that’s why there will be linear dependence when including a constant.) Thus, we usually omit one dummy variable from each categorical variable. In that case, the coefficients on the remaining dummies are differences in means between that category and the omitted one (perhaps conditional on other variables included, if included). So if we omitted \(X_{i4}\), then the coefficient on \(X_{i1}\) would be the difference in mean outcomes between units in the West and Northeast regions.
@@ -493,7 +493,7 @@ \(\X_i = (1\; X_{i})'\)
, where \(X_i \in \{0,1\}\). In this case, the OLS coefficient on \(X_i\), \(\widehat{\beta_{1}}\), is exactly equal to the difference in sample means of \(Y_i\) in the \(X_i = 1\) group and the \(X_i = 0\) group: \[
\widehat{\beta}_{1} = \frac{\sum_{i=1}^{n} X_{i}Y_{i}}{\sum_{i=1}^{n} X_{i}} - \frac{\sum_{i=1}^{n} (1 - X_{i})Y_{i}}{\sum_{i=1}^{n} 1- X_{i}} = \overline{Y}_{X =1} - \overline{Y}_{X=0}
\] This result is not an approximation. It holds exactly for any sample size.
-We can generalize this idea to discrete variables more broadly. Suppose we have our region variables from the last section and include in our covariates a constant and the dummies for the West, Midwest, and South regions. Then coefficient on the West dummy will be \[
+We can generalize this idea to discrete variables more broadly. Suppose we have our region variables from the last section and include in our covariates a constant and the dummies for the West, Midwest, and South regions. Then the coefficient on the West dummy will be \[
\widehat{\beta}_{\text{west}} = \overline{Y}_{\text{west}} - \overline{Y}_{\text{northeast}},
\] which is exactly the difference in sample means of \(Y_i\) between the West region and units in the “omitted region,” the Northeast.
Note that these interpretations only hold when the regression consists solely of the binary variable or the set of categorical dummy variables. These exact relationships fail when other covariates are added to the model.
@@ -501,7 +501,7 @@
6.6 Projection and geometry of least squares
OLS has a very nice geometric interpretation that can add a lot of intuition for various aspects of the method. In this geometric approach, we view \(\mb{Y}\) as an \(n\)-dimensional vector in \(\mathbb{R}^n\). As we saw above, OLS in matrix form is about finding a linear combination of the covariate matrix \(\Xmat\) closest to this vector in terms of the Euclidean distance (which is just the sum of squares).
-Let \(\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^2\}\) be the column space of the matrix \(\Xmat\). This set is all linear combinations of the columns of \(\Xmat\) or the set of all possible linear predictions we could obtain from \(\Xmat\). Notice that the OLS fitted values, \(\Xmat\bhat\), is in this column space. If, as we assume, \(\Xmat\) has full column rank of \(k+1\), then the column space \(\mathcal{C}(\Xmat)\) will be a \(k+1\)-dimensional surface inside of the larger \(n\)-dimensional space. If \(\Xmat\) has two columns, the column space will be a plane.
+Let \(\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^(k+1)\}\) be the column space of the matrix \(\Xmat\). This set is all linear combinations of the columns of \(\Xmat\) or the set of all possible linear predictions we could obtain from \(\Xmat\). Notice that the OLS fitted values, \(\Xmat\bhat\), are in this column space. If, as we assume, \(\Xmat\) has full column rank of \(k+1\), then the column space \(\mathcal{C}(\Xmat)\) will be a \(k+1\)-dimensional surface inside of the larger \(n\)-dimensional space. If \(\Xmat\) has two columns, the column space will be a plane.
Another interpretation of the OLS estimator is that it finds the linear predictor as the closest point in the column space of \(\Xmat\) to the outcome vector \(\mb{Y}\). This is called the projection of \(\mb{Y}\) onto \(\mathcal{C}(\Xmat)\). Figure 6.4 shows this projection for a case with \(n=3\) and 2 columns in \(\Xmat\). The shaded blue region represents the plane of the column space of \(\Xmat\), and we can see that \(\Xmat\bhat\) is the closest point to \(\mb{Y}\) in that space. That’s the whole idea of the OLS estimator: find the linear combination of the columns of \(\Xmat\) (a point in the column space) that minimizes the Euclidean distance between that point and the outcome vector (the sum of squared residuals).
-This figure shows that the residual vector, which is the difference between the \(\mb{Y}\) vector and the projection \(\Xmat\bhat\) is perpendicular or orthogonal to the column space of \(\Xmat\). This orthogonality is a consequence of the residuals being orthogonal to all the columns of \(\Xmat\), \[
+This figure shows that the residual vector, which is the difference between the \(\mb{Y}\) vector and the projection \(\Xmat\bhat\), is perpendicular or orthogonal to the column space of \(\Xmat\). This orthogonality is a consequence of the residuals being orthogonal to all the columns of \(\Xmat\), \[
\Xmat'\mb{e} = 0,
\] as we established above. Being orthogonal to all the columns means it will also be orthogonal to all linear combinations of the columns.
6.7 Projection and annihilator matrices
-Now that we have the idea of projection to the column space of \(\Xmat\), we can define a way to project any vector into that space. The \(n\times n\) projection matrix \[
+Now that we have the idea of projection to the column space of \(\Xmat\), we can define a way to project any vector into that space. The \(n\times n\) projection matrix, \[
\mb{P}_{\Xmat} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat',
\] projects a vector into \(\mathcal{C}(\Xmat)\). In particular, we can see that this gives us the fitted values for \(\mb{Y}\): \[
\mb{P}_{\Xmat}\mb{Y} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat'\mb{Y} = \Xmat\bhat.
-\] Because we sometimes write the linear predictor as \(\widehat{\mb{Y}} = \Xmat\bhat\), the projection matrix is also called the hat matrix. With either name, multiplying a vector by \(\mb{P}_{\Xmat}\) gives the best linear predictor of that vector as a function of \(\Xmat\). Intuitively, any vector that is already a linear combination of the columns of \(\Xmat\) (so is in \(\mathcal{C}(\Xmat)\)) should be unaffected by this projection: the closest point in \(\mathcal{C}(\Xmat)\) to a point already in \(\mathcal{C}(\Xmat)\) is itself. We can also see this algebraically for any linear combination \(\Xmat\mb{c}\) \[
-\mb{P}_{\Xmat}\Xmat\mb{c} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat'\Xmat\mb{c} = \Xmat\mb{c}
+\] Because we sometimes write the linear predictor as \(\widehat{\mb{Y}} = \Xmat\bhat\), the projection matrix is also called the hat matrix. With either name, multiplying a vector by \(\mb{P}_{\Xmat}\) gives the best linear predictor of that vector as a function of \(\Xmat\). Intuitively, any vector that is already a linear combination of the columns of \(\Xmat\) (so is in \(\mathcal{C}(\Xmat)\)) should be unaffected by this projection: the closest point in \(\mathcal{C}(\Xmat)\) to a point already in \(\mathcal{C}(\Xmat)\) is itself. We can also see this algebraically for any linear combination \(\Xmat\mb{c}\), \[
+\mb{P}_{\Xmat}\Xmat\mb{c} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat'\Xmat\mb{c} = \Xmat\mb{c},
\] because \((\Xmat'\Xmat)^{-1} \Xmat'\Xmat\) simplifies to the identity matrix. In particular, the projection of \(\Xmat\) onto itself is just itself: \(\mb{P}_{\Xmat}\Xmat = \Xmat\).
The second matrix related to projection is the annihilator matrix, \[
\mb{M}_{\Xmat} = \mb{I}_{n} - \mb{P}_{\Xmat},
\] which projects any vector into the orthogonal complement to the column space of \(\Xmat\), \[
-\mathcal{C}^{\perp}(\Xmat) = \{\mb{c} \in \mathbb{R}^n\;:\; \Xmat\mb{c} = 0 \},
+\mathcal{C}^{\perp}(\Xmat) = \{\mb{c} \in \mathbb{R}^n\;:\; \Xmat\mb{c} = 0 \}.
\] This matrix is called the annihilator matrix because if you apply it to any linear combination of \(\Xmat\), you get 0: \[
\mb{M}_{\Xmat}\Xmat\mb{c} = \Xmat\mb{c} - \mb{P}_{\Xmat}\Xmat\mb{c} = \Xmat\mb{c} - \Xmat\mb{c} = 0,
\] and in particular, \(\mb{M}_{\Xmat}\Xmat = 0\). Why should we care about this matrix? Perhaps a more evocative name might be the residual maker since it makes residuals when applied to \(\mb{Y}\), \[
\mb{M}_{\Xmat}\mb{Y} = (\mb{I}_{n} - \mb{P}_{\Xmat})\mb{Y} = \mb{Y} - \mb{P}_{\Xmat}\mb{Y} = \mb{Y} - \Xmat\bhat = \widehat{\mb{e}}.
\]
-There are several fundamental property properties of the projection matrix that are useful:
+There are several fundamental properties of the projection matrix that are useful:
\(\mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}\) are idempotent, which means that when applied to itself, it simply returns itself: \(\mb{P}_{\Xmat}\mb{P}_{\Xmat} = \mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}\mb{M}_{\Xmat} = \mb{M}_{\Xmat}\).
\(\mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}\) are symmetric \(n \times n\) matrices so that \(\mb{P}_{\Xmat}' = \mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}' = \mb{M}_{\Xmat}\).
@@ -543,7 +543,7 @@
6.8 Residual regression
-
There are many situations where we can partition the covariates into two groups, and we might wonder if it is possible how to express or calculate the OLS coefficients for just one set of covariates. In particular, let the columns of \(\Xmat\) be partitioned into \([\Xmat_{1} \Xmat_{2}]\), so that linear prediction we are estimating is \[
+There are many situations where we can partition the covariates into two groups, and we might wonder if it is possible how to express or calculate the OLS coefficients for just one set of covariates. In particular, let the columns of \(\Xmat\) be partitioned into \([\Xmat_{1} \Xmat_{2}]\), so that the linear prediction we are estimating is \[
\mb{Y} = \Xmat_{1}\bfbeta_{1} + \Xmat_{2}\bfbeta_{2} + \mb{e},
\] with estimated coefficients and residuals \[
\mb{Y} = \Xmat_{1}\bhat_{1} + \Xmat_{2}\bhat_{2} + \widehat{\mb{e}}.
@@ -563,21 +563,21 @@
- Use OLS to regress \(\mb{Y}\) on \(\Xmat_2\) and obtain residuals \(\widetilde{\mb{e}}_2\).
- Use OLS to regress each column of \(\Xmat_1\) on \(\Xmat_2\) and obtain residuals \(\widetilde{\Xmat}_1\).
-- Use OLS to regression \(\widetilde{\mb{e}}_{2}\) on \(\widetilde{\Xmat}_1\).
+- Use OLS to regress \(\widetilde{\mb{e}}_{2}\) on \(\widetilde{\Xmat}_1\).
6.4 Rank, linear independence, and multicollinearity
-When introducing the OLS estimator, we noted that it would exist when \(\sum_{i=1}^n \X_i\X_i'\) is positive definite or that there is “no multicollinearity.” This assumption is equivalent to saying the matrix \(\mathbb{X}\) is full column rank, meaning that \(\text{rank}(\mathbb{X}) = (k+1)\), where \(k+1\) is the number of columns of \(\mathbb{X}\). Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and linear independence means that if \(\mathbb{X}\mb{b} = 0\) if and only if \(\mb{b}\) is a column vector of 0s. In other words, we have \[
+When introducing the OLS estimator, we noted that it would exist when \(\sum_{i=1}^n \X_i\X_i'\) is positive definite or that there is “no multicollinearity.” This assumption is equivalent to saying that the matrix \(\mathbb{X}\) is full column rank, meaning that \(\text{rank}(\mathbb{X}) = (k+1)\), where \(k+1\) is the number of columns of \(\mathbb{X}\). Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and linear independence means that \(\mathbb{X}\mb{b} = 0\) if and only if \(\mb{b}\) is a column vector of 0s. In other words, we have \[
b_{1}\mathbb{X}_{1} + b_{2}\mathbb{X}_{2} + \cdots + b_{k+1}\mathbb{X}_{k+1} = 0 \quad\iff\quad b_{1} = b_{2} = \cdots = b_{k+1} = 0,
\] where \(\mathbb{X}_j\) is the \(j\)th column of \(\mathbb{X}\). Thus, full column rank says that all the columns are linearly independent or that there is no “multicollinearity.”
How could this be violated? Suppose we accidentally included a linear function of one variable so that \(\mathbb{X}_2 = 2\mathbb{X}_1\). Then we have, \[
@@ -481,7 +481,7 @@ \(b_3 = b_4 = \cdots = b_{k+1} = 0\) and \(b_1 = -2b_2\). Thus, the collection of columns is linearly dependent, so we know that the rank of \(\mathbb{X}\) must be less than full column rank (that is, less than \(k+1\)). Hopefully, it is also clear that if we removed the problematic column \(\mathbb{X}_2\), the resulting matrix would have \(k\) linearly independent columns, implying that \(\mathbb{X}\) is rank \(k\).
-Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that \(\Xmat\) if of full column rank if and only if \(\Xmat'\Xmat\) is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of \(\Xmat\) being linearly independent means that the inverse \((\Xmat'\Xmat)^{-1}\) exists and so does \(\bhat\). Further, this full rank condition also implies that \(\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'\) is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.
+Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that \(\Xmat\) is of full column rank if and only if \(\Xmat'\Xmat\) is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of \(\Xmat\) being linearly independent means that the inverse \((\Xmat'\Xmat)^{-1}\) exists and so does \(\bhat\). Further, this full rank condition also implies that \(\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'\) is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.
What are common situations that lead to violations of no multicollinearity? We have seen one above, with one variable being a linear function of another. But this problem can come out in more subtle ways. Suppose that we have a set of dummy variables corresponding to a single categorical variable, like the region of the country. In the US, this might mean we have \(X_{i1} = 1\) for units in the West (0 otherwise), \(X_{i2} = 1\) for units in the Midwest (0 otherwise), \(X_{i3} = 1\) for units in the South (0 otherwise), and \(X_{i4} = 1\) for units in the Northeast (0 otherwise). Each unit has to be in one of these four regions, so there is a linear dependence between these variables, \[
X_{i4} = 1 - X_{i1} - X_{i2} - X_{i3}.
\] That is, if I know that you are not in the West, Midwest, or South regions, I know that you are in the Northeast. We would get a linear dependence if we tried to include all of these variables in our regression with an intercept. (Note the 1 in the relationship between \(X_{i4}\) and the other variables, that’s why there will be linear dependence when including a constant.) Thus, we usually omit one dummy variable from each categorical variable. In that case, the coefficients on the remaining dummies are differences in means between that category and the omitted one (perhaps conditional on other variables included, if included). So if we omitted \(X_{i4}\), then the coefficient on \(X_{i1}\) would be the difference in mean outcomes between units in the West and Northeast regions.
@@ -493,7 +493,7 @@ \(\X_i = (1\; X_{i})'\)
, where \(X_i \in \{0,1\}\). In this case, the OLS coefficient on \(X_i\), \(\widehat{\beta_{1}}\), is exactly equal to the difference in sample means of \(Y_i\) in the \(X_i = 1\) group and the \(X_i = 0\) group: \[
\widehat{\beta}_{1} = \frac{\sum_{i=1}^{n} X_{i}Y_{i}}{\sum_{i=1}^{n} X_{i}} - \frac{\sum_{i=1}^{n} (1 - X_{i})Y_{i}}{\sum_{i=1}^{n} 1- X_{i}} = \overline{Y}_{X =1} - \overline{Y}_{X=0}
\] This result is not an approximation. It holds exactly for any sample size.
-We can generalize this idea to discrete variables more broadly. Suppose we have our region variables from the last section and include in our covariates a constant and the dummies for the West, Midwest, and South regions. Then coefficient on the West dummy will be \[
+We can generalize this idea to discrete variables more broadly. Suppose we have our region variables from the last section and include in our covariates a constant and the dummies for the West, Midwest, and South regions. Then the coefficient on the West dummy will be \[
\widehat{\beta}_{\text{west}} = \overline{Y}_{\text{west}} - \overline{Y}_{\text{northeast}},
\] which is exactly the difference in sample means of \(Y_i\) between the West region and units in the “omitted region,” the Northeast.
Note that these interpretations only hold when the regression consists solely of the binary variable or the set of categorical dummy variables. These exact relationships fail when other covariates are added to the model.
@@ -501,7 +501,7 @@
6.6 Projection and geometry of least squares
OLS has a very nice geometric interpretation that can add a lot of intuition for various aspects of the method. In this geometric approach, we view \(\mb{Y}\) as an \(n\)-dimensional vector in \(\mathbb{R}^n\). As we saw above, OLS in matrix form is about finding a linear combination of the covariate matrix \(\Xmat\) closest to this vector in terms of the Euclidean distance (which is just the sum of squares).
-Let \(\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^2\}\) be the column space of the matrix \(\Xmat\). This set is all linear combinations of the columns of \(\Xmat\) or the set of all possible linear predictions we could obtain from \(\Xmat\). Notice that the OLS fitted values, \(\Xmat\bhat\), is in this column space. If, as we assume, \(\Xmat\) has full column rank of \(k+1\), then the column space \(\mathcal{C}(\Xmat)\) will be a \(k+1\)-dimensional surface inside of the larger \(n\)-dimensional space. If \(\Xmat\) has two columns, the column space will be a plane.
+Let \(\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^(k+1)\}\) be the column space of the matrix \(\Xmat\). This set is all linear combinations of the columns of \(\Xmat\) or the set of all possible linear predictions we could obtain from \(\Xmat\). Notice that the OLS fitted values, \(\Xmat\bhat\), are in this column space. If, as we assume, \(\Xmat\) has full column rank of \(k+1\), then the column space \(\mathcal{C}(\Xmat)\) will be a \(k+1\)-dimensional surface inside of the larger \(n\)-dimensional space. If \(\Xmat\) has two columns, the column space will be a plane.
Another interpretation of the OLS estimator is that it finds the linear predictor as the closest point in the column space of \(\Xmat\) to the outcome vector \(\mb{Y}\). This is called the projection of \(\mb{Y}\) onto \(\mathcal{C}(\Xmat)\). Figure 6.4 shows this projection for a case with \(n=3\) and 2 columns in \(\Xmat\). The shaded blue region represents the plane of the column space of \(\Xmat\), and we can see that \(\Xmat\bhat\) is the closest point to \(\mb{Y}\) in that space. That’s the whole idea of the OLS estimator: find the linear combination of the columns of \(\Xmat\) (a point in the column space) that minimizes the Euclidean distance between that point and the outcome vector (the sum of squared residuals).
-This figure shows that the residual vector, which is the difference between the \(\mb{Y}\) vector and the projection \(\Xmat\bhat\) is perpendicular or orthogonal to the column space of \(\Xmat\). This orthogonality is a consequence of the residuals being orthogonal to all the columns of \(\Xmat\), \[
+This figure shows that the residual vector, which is the difference between the \(\mb{Y}\) vector and the projection \(\Xmat\bhat\), is perpendicular or orthogonal to the column space of \(\Xmat\). This orthogonality is a consequence of the residuals being orthogonal to all the columns of \(\Xmat\), \[
\Xmat'\mb{e} = 0,
\] as we established above. Being orthogonal to all the columns means it will also be orthogonal to all linear combinations of the columns.
6.7 Projection and annihilator matrices
-Now that we have the idea of projection to the column space of \(\Xmat\), we can define a way to project any vector into that space. The \(n\times n\) projection matrix \[
+Now that we have the idea of projection to the column space of \(\Xmat\), we can define a way to project any vector into that space. The \(n\times n\) projection matrix, \[
\mb{P}_{\Xmat} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat',
\] projects a vector into \(\mathcal{C}(\Xmat)\). In particular, we can see that this gives us the fitted values for \(\mb{Y}\): \[
\mb{P}_{\Xmat}\mb{Y} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat'\mb{Y} = \Xmat\bhat.
-\] Because we sometimes write the linear predictor as \(\widehat{\mb{Y}} = \Xmat\bhat\), the projection matrix is also called the hat matrix. With either name, multiplying a vector by \(\mb{P}_{\Xmat}\) gives the best linear predictor of that vector as a function of \(\Xmat\). Intuitively, any vector that is already a linear combination of the columns of \(\Xmat\) (so is in \(\mathcal{C}(\Xmat)\)) should be unaffected by this projection: the closest point in \(\mathcal{C}(\Xmat)\) to a point already in \(\mathcal{C}(\Xmat)\) is itself. We can also see this algebraically for any linear combination \(\Xmat\mb{c}\) \[
-\mb{P}_{\Xmat}\Xmat\mb{c} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat'\Xmat\mb{c} = \Xmat\mb{c}
+\] Because we sometimes write the linear predictor as \(\widehat{\mb{Y}} = \Xmat\bhat\), the projection matrix is also called the hat matrix. With either name, multiplying a vector by \(\mb{P}_{\Xmat}\) gives the best linear predictor of that vector as a function of \(\Xmat\). Intuitively, any vector that is already a linear combination of the columns of \(\Xmat\) (so is in \(\mathcal{C}(\Xmat)\)) should be unaffected by this projection: the closest point in \(\mathcal{C}(\Xmat)\) to a point already in \(\mathcal{C}(\Xmat)\) is itself. We can also see this algebraically for any linear combination \(\Xmat\mb{c}\), \[
+\mb{P}_{\Xmat}\Xmat\mb{c} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat'\Xmat\mb{c} = \Xmat\mb{c},
\] because \((\Xmat'\Xmat)^{-1} \Xmat'\Xmat\) simplifies to the identity matrix. In particular, the projection of \(\Xmat\) onto itself is just itself: \(\mb{P}_{\Xmat}\Xmat = \Xmat\).
The second matrix related to projection is the annihilator matrix, \[
\mb{M}_{\Xmat} = \mb{I}_{n} - \mb{P}_{\Xmat},
\] which projects any vector into the orthogonal complement to the column space of \(\Xmat\), \[
-\mathcal{C}^{\perp}(\Xmat) = \{\mb{c} \in \mathbb{R}^n\;:\; \Xmat\mb{c} = 0 \},
+\mathcal{C}^{\perp}(\Xmat) = \{\mb{c} \in \mathbb{R}^n\;:\; \Xmat\mb{c} = 0 \}.
\] This matrix is called the annihilator matrix because if you apply it to any linear combination of \(\Xmat\), you get 0: \[
\mb{M}_{\Xmat}\Xmat\mb{c} = \Xmat\mb{c} - \mb{P}_{\Xmat}\Xmat\mb{c} = \Xmat\mb{c} - \Xmat\mb{c} = 0,
\] and in particular, \(\mb{M}_{\Xmat}\Xmat = 0\). Why should we care about this matrix? Perhaps a more evocative name might be the residual maker since it makes residuals when applied to \(\mb{Y}\), \[
\mb{M}_{\Xmat}\mb{Y} = (\mb{I}_{n} - \mb{P}_{\Xmat})\mb{Y} = \mb{Y} - \mb{P}_{\Xmat}\mb{Y} = \mb{Y} - \Xmat\bhat = \widehat{\mb{e}}.
\]
-There are several fundamental property properties of the projection matrix that are useful:
+There are several fundamental properties of the projection matrix that are useful:
\(\mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}\) are idempotent, which means that when applied to itself, it simply returns itself: \(\mb{P}_{\Xmat}\mb{P}_{\Xmat} = \mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}\mb{M}_{\Xmat} = \mb{M}_{\Xmat}\).
\(\mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}\) are symmetric \(n \times n\) matrices so that \(\mb{P}_{\Xmat}' = \mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}' = \mb{M}_{\Xmat}\).
@@ -543,7 +543,7 @@
6.8 Residual regression
-
There are many situations where we can partition the covariates into two groups, and we might wonder if it is possible how to express or calculate the OLS coefficients for just one set of covariates. In particular, let the columns of \(\Xmat\) be partitioned into \([\Xmat_{1} \Xmat_{2}]\), so that linear prediction we are estimating is \[
+There are many situations where we can partition the covariates into two groups, and we might wonder if it is possible how to express or calculate the OLS coefficients for just one set of covariates. In particular, let the columns of \(\Xmat\) be partitioned into \([\Xmat_{1} \Xmat_{2}]\), so that the linear prediction we are estimating is \[
\mb{Y} = \Xmat_{1}\bfbeta_{1} + \Xmat_{2}\bfbeta_{2} + \mb{e},
\] with estimated coefficients and residuals \[
\mb{Y} = \Xmat_{1}\bhat_{1} + \Xmat_{2}\bhat_{2} + \widehat{\mb{e}}.
@@ -563,21 +563,21 @@
- Use OLS to regress \(\mb{Y}\) on \(\Xmat_2\) and obtain residuals \(\widetilde{\mb{e}}_2\).
- Use OLS to regress each column of \(\Xmat_1\) on \(\Xmat_2\) and obtain residuals \(\widetilde{\Xmat}_1\).
-- Use OLS to regression \(\widetilde{\mb{e}}_{2}\) on \(\widetilde{\Xmat}_1\).
+- Use OLS to regress \(\widetilde{\mb{e}}_{2}\) on \(\widetilde{\Xmat}_1\).
6.4 Rank, linear independence, and multicollinearity
-When introducing the OLS estimator, we noted that it would exist when \(\sum_{i=1}^n \X_i\X_i'\) is positive definite or that there is “no multicollinearity.” This assumption is equivalent to saying the matrix \(\mathbb{X}\) is full column rank, meaning that \(\text{rank}(\mathbb{X}) = (k+1)\), where \(k+1\) is the number of columns of \(\mathbb{X}\). Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and linear independence means that if \(\mathbb{X}\mb{b} = 0\) if and only if \(\mb{b}\) is a column vector of 0s. In other words, we have \[
+ When introducing the OLS estimator, we noted that it would exist when \(\sum_{i=1}^n \X_i\X_i'\) is positive definite or that there is “no multicollinearity.” This assumption is equivalent to saying that the matrix \(\mathbb{X}\) is full column rank, meaning that \(\text{rank}(\mathbb{X}) = (k+1)\), where \(k+1\) is the number of columns of \(\mathbb{X}\). Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and linear independence means that \(\mathbb{X}\mb{b} = 0\) if and only if \(\mb{b}\) is a column vector of 0s. In other words, we have \[
b_{1}\mathbb{X}_{1} + b_{2}\mathbb{X}_{2} + \cdots + b_{k+1}\mathbb{X}_{k+1} = 0 \quad\iff\quad b_{1} = b_{2} = \cdots = b_{k+1} = 0,
\] where \(\mathbb{X}_j\) is the \(j\)th column of \(\mathbb{X}\). Thus, full column rank says that all the columns are linearly independent or that there is no “multicollinearity.” How could this be violated? Suppose we accidentally included a linear function of one variable so that \(\mathbb{X}_2 = 2\mathbb{X}_1\). Then we have, \[
@@ -481,7 +481,7 @@ Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that \(\Xmat\) if of full column rank if and only if \(\Xmat'\Xmat\) is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of \(\Xmat\) being linearly independent means that the inverse \((\Xmat'\Xmat)^{-1}\) exists and so does \(\bhat\). Further, this full rank condition also implies that \(\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'\) is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals. Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that \(\Xmat\) is of full column rank if and only if \(\Xmat'\Xmat\) is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of \(\Xmat\) being linearly independent means that the inverse \((\Xmat'\Xmat)^{-1}\) exists and so does \(\bhat\). Further, this full rank condition also implies that \(\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'\) is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals. What are common situations that lead to violations of no multicollinearity? We have seen one above, with one variable being a linear function of another. But this problem can come out in more subtle ways. Suppose that we have a set of dummy variables corresponding to a single categorical variable, like the region of the country. In the US, this might mean we have \(X_{i1} = 1\) for units in the West (0 otherwise), \(X_{i2} = 1\) for units in the Midwest (0 otherwise), \(X_{i3} = 1\) for units in the South (0 otherwise), and \(X_{i4} = 1\) for units in the Northeast (0 otherwise). Each unit has to be in one of these four regions, so there is a linear dependence between these variables, \[
X_{i4} = 1 - X_{i1} - X_{i2} - X_{i3}.
\] That is, if I know that you are not in the West, Midwest, or South regions, I know that you are in the Northeast. We would get a linear dependence if we tried to include all of these variables in our regression with an intercept. (Note the 1 in the relationship between \(X_{i4}\) and the other variables, that’s why there will be linear dependence when including a constant.) Thus, we usually omit one dummy variable from each categorical variable. In that case, the coefficients on the remaining dummies are differences in means between that category and the omitted one (perhaps conditional on other variables included, if included). So if we omitted \(X_{i4}\), then the coefficient on \(X_{i1}\) would be the difference in mean outcomes between units in the West and Northeast regions.\(b_3 = b_4 = \cdots = b_{k+1} = 0\) and \(b_1 = -2b_2\). Thus, the collection of columns is linearly dependent, so we know that the rank of \(\mathbb{X}\) must be less than full column rank (that is, less than \(k+1\)). Hopefully, it is also clear that if we removed the problematic column \(\mathbb{X}_2\), the resulting matrix would have \(k\) linearly independent columns, implying that \(\mathbb{X}\) is rank \(k\).
\(\X_i = (1\; X_{i})'\)
, where \(X_i \in \{0,1\}\). In this case, the OLS coefficient on \(X_i\), \(\widehat{\beta_{1}}\), is exactly equal to the difference in sample means of \(Y_i\) in the \(X_i = 1\) group and the \(X_i = 0\) group: \[
\widehat{\beta}_{1} = \frac{\sum_{i=1}^{n} X_{i}Y_{i}}{\sum_{i=1}^{n} X_{i}} - \frac{\sum_{i=1}^{n} (1 - X_{i})Y_{i}}{\sum_{i=1}^{n} 1- X_{i}} = \overline{Y}_{X =1} - \overline{Y}_{X=0}
\] This result is not an approximation. It holds exactly for any sample size.
We can generalize this idea to discrete variables more broadly. Suppose we have our region variables from the last section and include in our covariates a constant and the dummies for the West, Midwest, and South regions. Then coefficient on the West dummy will be \[
+ We can generalize this idea to discrete variables more broadly. Suppose we have our region variables from the last section and include in our covariates a constant and the dummies for the West, Midwest, and South regions. Then the coefficient on the West dummy will be \[
\widehat{\beta}_{\text{west}} = \overline{Y}_{\text{west}} - \overline{Y}_{\text{northeast}},
\] which is exactly the difference in sample means of \(Y_i\) between the West region and units in the “omitted region,” the Northeast. Note that these interpretations only hold when the regression consists solely of the binary variable or the set of categorical dummy variables. These exact relationships fail when other covariates are added to the model. OLS has a very nice geometric interpretation that can add a lot of intuition for various aspects of the method. In this geometric approach, we view \(\mb{Y}\) as an \(n\)-dimensional vector in \(\mathbb{R}^n\). As we saw above, OLS in matrix form is about finding a linear combination of the covariate matrix \(\Xmat\) closest to this vector in terms of the Euclidean distance (which is just the sum of squares). Let \(\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^2\}\) be the column space of the matrix \(\Xmat\). This set is all linear combinations of the columns of \(\Xmat\) or the set of all possible linear predictions we could obtain from \(\Xmat\). Notice that the OLS fitted values, \(\Xmat\bhat\), is in this column space. If, as we assume, \(\Xmat\) has full column rank of \(k+1\), then the column space \(\mathcal{C}(\Xmat)\) will be a \(k+1\)-dimensional surface inside of the larger \(n\)-dimensional space. If \(\Xmat\) has two columns, the column space will be a plane. Let \(\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^(k+1)\}\) be the column space of the matrix \(\Xmat\). This set is all linear combinations of the columns of \(\Xmat\) or the set of all possible linear predictions we could obtain from \(\Xmat\). Notice that the OLS fitted values, \(\Xmat\bhat\), are in this column space. If, as we assume, \(\Xmat\) has full column rank of \(k+1\), then the column space \(\mathcal{C}(\Xmat)\) will be a \(k+1\)-dimensional surface inside of the larger \(n\)-dimensional space. If \(\Xmat\) has two columns, the column space will be a plane. Another interpretation of the OLS estimator is that it finds the linear predictor as the closest point in the column space of \(\Xmat\) to the outcome vector \(\mb{Y}\). This is called the projection of \(\mb{Y}\) onto \(\mathcal{C}(\Xmat)\). Figure 6.4 shows this projection for a case with \(n=3\) and 2 columns in \(\Xmat\). The shaded blue region represents the plane of the column space of \(\Xmat\), and we can see that \(\Xmat\bhat\) is the closest point to \(\mb{Y}\) in that space. That’s the whole idea of the OLS estimator: find the linear combination of the columns of \(\Xmat\) (a point in the column space) that minimizes the Euclidean distance between that point and the outcome vector (the sum of squared residuals). This figure shows that the residual vector, which is the difference between the \(\mb{Y}\) vector and the projection \(\Xmat\bhat\) is perpendicular or orthogonal to the column space of \(\Xmat\). This orthogonality is a consequence of the residuals being orthogonal to all the columns of \(\Xmat\), \[
+ This figure shows that the residual vector, which is the difference between the \(\mb{Y}\) vector and the projection \(\Xmat\bhat\), is perpendicular or orthogonal to the column space of \(\Xmat\). This orthogonality is a consequence of the residuals being orthogonal to all the columns of \(\Xmat\), \[
\Xmat'\mb{e} = 0,
\] as we established above. Being orthogonal to all the columns means it will also be orthogonal to all linear combinations of the columns.
6.6 Projection and geometry of least squares
6.7 Projection and annihilator matrices
-Now that we have the idea of projection to the column space of \(\Xmat\), we can define a way to project any vector into that space. The \(n\times n\) projection matrix \[
+ Now that we have the idea of projection to the column space of \(\Xmat\), we can define a way to project any vector into that space. The \(n\times n\) projection matrix, \[
\mb{P}_{\Xmat} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat',
\] projects a vector into \(\mathcal{C}(\Xmat)\). In particular, we can see that this gives us the fitted values for \(\mb{Y}\): \[
\mb{P}_{\Xmat}\mb{Y} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat'\mb{Y} = \Xmat\bhat.
-\] Because we sometimes write the linear predictor as \(\widehat{\mb{Y}} = \Xmat\bhat\), the projection matrix is also called the hat matrix. With either name, multiplying a vector by \(\mb{P}_{\Xmat}\) gives the best linear predictor of that vector as a function of \(\Xmat\). Intuitively, any vector that is already a linear combination of the columns of \(\Xmat\) (so is in \(\mathcal{C}(\Xmat)\)) should be unaffected by this projection: the closest point in \(\mathcal{C}(\Xmat)\) to a point already in \(\mathcal{C}(\Xmat)\) is itself. We can also see this algebraically for any linear combination \(\Xmat\mb{c}\) \[
-\mb{P}_{\Xmat}\Xmat\mb{c} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat'\Xmat\mb{c} = \Xmat\mb{c}
+\] Because we sometimes write the linear predictor as \(\widehat{\mb{Y}} = \Xmat\bhat\), the projection matrix is also called the hat matrix. With either name, multiplying a vector by \(\mb{P}_{\Xmat}\) gives the best linear predictor of that vector as a function of \(\Xmat\). Intuitively, any vector that is already a linear combination of the columns of \(\Xmat\) (so is in \(\mathcal{C}(\Xmat)\)) should be unaffected by this projection: the closest point in \(\mathcal{C}(\Xmat)\) to a point already in \(\mathcal{C}(\Xmat)\) is itself. We can also see this algebraically for any linear combination \(\Xmat\mb{c}\), \[
+\mb{P}_{\Xmat}\Xmat\mb{c} = \Xmat (\Xmat'\Xmat)^{-1} \Xmat'\Xmat\mb{c} = \Xmat\mb{c},
\] because \((\Xmat'\Xmat)^{-1} \Xmat'\Xmat\) simplifies to the identity matrix. In particular, the projection of \(\Xmat\) onto itself is just itself: \(\mb{P}_{\Xmat}\Xmat = \Xmat\). The second matrix related to projection is the annihilator matrix, \[
\mb{M}_{\Xmat} = \mb{I}_{n} - \mb{P}_{\Xmat},
\] which projects any vector into the orthogonal complement to the column space of \(\Xmat\), \[
-\mathcal{C}^{\perp}(\Xmat) = \{\mb{c} \in \mathbb{R}^n\;:\; \Xmat\mb{c} = 0 \},
+\mathcal{C}^{\perp}(\Xmat) = \{\mb{c} \in \mathbb{R}^n\;:\; \Xmat\mb{c} = 0 \}.
\] This matrix is called the annihilator matrix because if you apply it to any linear combination of \(\Xmat\), you get 0: \[
\mb{M}_{\Xmat}\Xmat\mb{c} = \Xmat\mb{c} - \mb{P}_{\Xmat}\Xmat\mb{c} = \Xmat\mb{c} - \Xmat\mb{c} = 0,
\] and in particular, \(\mb{M}_{\Xmat}\Xmat = 0\). Why should we care about this matrix? Perhaps a more evocative name might be the residual maker since it makes residuals when applied to \(\mb{Y}\), \[
\mb{M}_{\Xmat}\mb{Y} = (\mb{I}_{n} - \mb{P}_{\Xmat})\mb{Y} = \mb{Y} - \mb{P}_{\Xmat}\mb{Y} = \mb{Y} - \Xmat\bhat = \widehat{\mb{e}}.
\] There are several fundamental property properties of the projection matrix that are useful: There are several fundamental properties of the projection matrix that are useful: \(\mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}\) are idempotent, which means that when applied to itself, it simply returns itself: \(\mb{P}_{\Xmat}\mb{P}_{\Xmat} = \mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}\mb{M}_{\Xmat} = \mb{M}_{\Xmat}\). \(\mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}\) are symmetric \(n \times n\) matrices so that \(\mb{P}_{\Xmat}' = \mb{P}_{\Xmat}\) and \(\mb{M}_{\Xmat}' = \mb{M}_{\Xmat}\). There are many situations where we can partition the covariates into two groups, and we might wonder if it is possible how to express or calculate the OLS coefficients for just one set of covariates. In particular, let the columns of \(\Xmat\) be partitioned into \([\Xmat_{1} \Xmat_{2}]\), so that linear prediction we are estimating is \[
+ There are many situations where we can partition the covariates into two groups, and we might wonder if it is possible how to express or calculate the OLS coefficients for just one set of covariates. In particular, let the columns of \(\Xmat\) be partitioned into \([\Xmat_{1} \Xmat_{2}]\), so that the linear prediction we are estimating is \[
\mb{Y} = \Xmat_{1}\bfbeta_{1} + \Xmat_{2}\bfbeta_{2} + \mb{e},
\] with estimated coefficients and residuals \[
\mb{Y} = \Xmat_{1}\bhat_{1} + \Xmat_{2}\bhat_{2} + \widehat{\mb{e}}.
@@ -563,21 +563,21 @@
6.8 Residual regression
-