Skip to content

Commit

Permalink
Merge pull request #2 from mattblackwell/main
Browse files Browse the repository at this point in the history
merge
  • Loading branch information
noahdasanaike authored Dec 4, 2023
2 parents fc6ead7 + 786c070 commit 6520d71
Show file tree
Hide file tree
Showing 18 changed files with 30 additions and 30 deletions.
2 changes: 1 addition & 1 deletion 06_linear_model.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -418,7 +418,7 @@ Here, the coefficients are slightly more interpretable:

* $\beta_1$: the marginal effect of $X_{i1}$ on predicted $Y_i$ when $X_{i2} = 0$.
* $\beta_2$: the marginal effect of $X_{i2}$ on predicted $Y_i$ when $X_{i1} = 0$.
* $\beta_3$: the change in the marginal effect of $X_{i1}$ due to a one-unit change in $X_{i2}$ **OR** the change in the marginal effect of $X_{i2}$ due to a one-unit change in $X_{i2}$.
* $\beta_3$: the change in the marginal effect of $X_{i1}$ due to a one-unit change in $X_{i2}$ **OR** the change in the marginal effect of $X_{i2}$ due to a one-unit change in $X_{i1}$.

If we add more covariates to this BLP, these interpretations change to "holding all other covariates constant."

Expand Down
12 changes: 6 additions & 6 deletions 07_least_squares.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -284,7 +284,7 @@ $$

## Rank, linear independence, and multicollinearity {#sec-rank}

When introducing the OLS estimator, we noted that it would exist when $\sum_{i=1}^n \X_i\X_i'$ is positive definite or that there is "no multicollinearity." This assumption is equivalent to saying that the matrix $\mathbb{X}$ is full column rank, meaning that $\text{rank}(\mathbb{X}) = (k+1)$, where $k+1$ is the number of columns of $\mathbb{X}$. Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and **linear independence** means that if $\mathbb{X}\mb{b} = 0$ if and only if $\mb{b}$ is a column vector of 0s. In other words, we have
When introducing the OLS estimator, we noted that it would exist when $\sum_{i=1}^n \X_i\X_i'$ is positive definite or that there is "no multicollinearity." This assumption is equivalent to saying that the matrix $\mathbb{X}$ is full column rank, meaning that $\text{rank}(\mathbb{X}) = (k+1)$, where $k+1$ is the number of columns of $\mathbb{X}$. Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and **linear independence** means that $\mathbb{X}\mb{b} = 0$ if and only if $\mb{b}$ is a column vector of 0s. In other words, we have
$$
b_{1}\mathbb{X}_{1} + b_{2}\mathbb{X}_{2} + \cdots + b_{k+1}\mathbb{X}_{k+1} = 0 \quad\iff\quad b_{1} = b_{2} = \cdots = b_{k+1} = 0,
$$
Expand All @@ -299,7 +299,7 @@ $$
$$
In this case, this expression equals 0 when $b_3 = b_4 = \cdots = b_{k+1} = 0$ and $b_1 = -2b_2$. Thus, the collection of columns is linearly dependent, so we know that the rank of $\mathbb{X}$ must be less than full column rank (that is, less than $k+1$). Hopefully, it is also clear that if we removed the problematic column $\mathbb{X}_2$, the resulting matrix would have $k$ linearly independent columns, implying that $\mathbb{X}$ is rank $k$.

Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that $\Xmat$ if of full column rank if and only if $\Xmat'\Xmat$ is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of $\Xmat$ being linearly independent means that the inverse $(\Xmat'\Xmat)^{-1}$ exists and so does $\bhat$. Further, this full rank condition also implies that $\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'$ is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.
Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that $\Xmat$ is of full column rank if and only if $\Xmat'\Xmat$ is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of $\Xmat$ being linearly independent means that the inverse $(\Xmat'\Xmat)^{-1}$ exists and so does $\bhat$. Further, this full rank condition also implies that $\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'$ is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.

What are common situations that lead to violations of no multicollinearity? We have seen one above, with one variable being a linear function of another. But this problem can come out in more subtle ways. Suppose that we have a set of dummy variables corresponding to a single categorical variable, like the region of the country. In the US, this might mean we have $X_{i1} = 1$ for units in the West (0 otherwise), $X_{i2} = 1$ for units in the Midwest (0 otherwise), $X_{i3} = 1$ for units in the South (0 otherwise), and $X_{i4} = 1$ for units in the Northeast (0 otherwise). Each unit has to be in one of these four regions, so there is a linear dependence between these variables,
$$
Expand Down Expand Up @@ -333,7 +333,7 @@ Note that these interpretations only hold when the regression consists solely of

OLS has a very nice geometric interpretation that can add a lot of intuition for various aspects of the method. In this geometric approach, we view $\mb{Y}$ as an $n$-dimensional vector in $\mathbb{R}^n$. As we saw above, OLS in matrix form is about finding a linear combination of the covariate matrix $\Xmat$ closest to this vector in terms of the Euclidean distance (which is just the sum of squares).

Let $\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^2\}$ be the **column space** of the matrix $\Xmat$. This set is all linear combinations of the columns of $\Xmat$ or the set of all possible linear predictions we could obtain from $\Xmat$. Notice that the OLS fitted values, $\Xmat\bhat$, are in this column space. If, as we assume, $\Xmat$ has full column rank of $k+1$, then the column space $\mathcal{C}(\Xmat)$ will be a $k+1$-dimensional surface inside of the larger $n$-dimensional space. If $\Xmat$ has two columns, the column space will be a plane.
Let $\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^(k+1)\}$ be the **column space** of the matrix $\Xmat$. This set is all linear combinations of the columns of $\Xmat$ or the set of all possible linear predictions we could obtain from $\Xmat$. Notice that the OLS fitted values, $\Xmat\bhat$, are in this column space. If, as we assume, $\Xmat$ has full column rank of $k+1$, then the column space $\mathcal{C}(\Xmat)$ will be a $k+1$-dimensional surface inside of the larger $n$-dimensional space. If $\Xmat$ has two columns, the column space will be a plane.

Another interpretation of the OLS estimator is that it finds the linear predictor as the closest point in the column space of $\Xmat$ to the outcome vector $\mb{Y}$. This is called the **projection** of $\mb{Y}$ onto $\mathcal{C}(\Xmat)$. @fig-projection shows this projection for a case with $n=3$ and 2 columns in $\Xmat$. The shaded blue region represents the plane of the column space of $\Xmat$, and we can see that $\Xmat\bhat$ is the closest point to $\mb{Y}$ in that space. That's the whole idea of the OLS estimator: find the linear combination of the columns of $\Xmat$ (a point in the column space) that minimizes the Euclidean distance between that point and the outcome vector (the sum of squared residuals).

Expand Down Expand Up @@ -431,7 +431,7 @@ The residual regression approach is:

1. Use OLS to regress $\mb{Y}$ on $\Xmat_2$ and obtain residuals $\widetilde{\mb{e}}_2$.
2. Use OLS to regress each column of $\Xmat_1$ on $\Xmat_2$ and obtain residuals $\widetilde{\Xmat}_1$.
3. Use OLS to regression $\widetilde{\mb{e}}_{2}$ on $\widetilde{\Xmat}_1$.
3. Use OLS to regress $\widetilde{\mb{e}}_{2}$ on $\widetilde{\Xmat}_1$.

:::

Expand Down Expand Up @@ -469,7 +469,7 @@ h_{ii} = \X_{i}'\left(\Xmat'\Xmat\right)^{-1}\X_{i},
$$
which is the $i$th diagonal entry of the projection matrix, $\mb{P}_{\Xmat}$. Notice that
$$
\widehat{\mb{Y}} = \mb{P}\mb{Y} \qquad \implies \qquad \widehat{Y}_i = \sum_{j=1}^n h_{ij}Y_j,
\widehat{\mb{Y}} = \mb{P}_{\Xmat}\mb{Y} \qquad \implies \qquad \widehat{Y}_i = \sum_{j=1}^n h_{ij}Y_j,
$$
so that $h_{ij}$ is the importance of observation $j$ for the fitted value for observation $i$. The leverage, then, is the importance of the observation for its own fitted value. We can also interpret these values in terms of the distribution of $\X_{i}$. Roughly speaking, these values are the weighted distance $\X_i$ is from $\overline{\X}$, where the weights normalize to the empirical variance/covariance structure of the covariates (so that the scale of each covariate is roughly the same). We can see this most clearly when we fit a simple linear regression (with one covariate and an intercept) with OLS when the leverage is
$$
Expand Down Expand Up @@ -545,7 +545,7 @@ text(5, 2, "Full sample", pos = 2, col = "dodgerblue")
text(7, 7, "Influence Point", pos = 1, col = "indianred")
```
One measure of influence is called DFBETA$_i$ measures how much $i$ changes the estimated coefficient vector
One measure of influence, called DFBETA$_i$, measures how much $i$ changes the estimated coefficient vector
$$
\bhat - \bhat_{(-i)} = \left( \Xmat'\Xmat\right)^{-1}\X_i\widetilde{e}_i,
$$
Expand Down
22 changes: 11 additions & 11 deletions 08_ols_properties.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ In this chapter, we will focus first on the asymptotic properties of OLS because

## Large-sample properties of OLS

As we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: a consistent estimate of the variance of $\bhat$ and the approximate distribution of $\bhat$ in large samples. Remember that since $\bhat$ is a vector, then the variance of that estimator will actually be a variance-covariance matrix. To obtain these two ingredients, we will first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which will include its variance.
As we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: a consistent estimate of the variance of $\bhat$ and the approximate distribution of $\bhat$ in large samples. Remember that since $\bhat$ is a vector, the variance of that estimator will actually be a variance-covariance matrix. To obtain these two ingredients, we will first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which will include its variance.


We begin by setting out the assumptions we will need for establishing the large-sample properties of OLS, which are the same as the assumptions needed to ensure that the best linear predictor, $\bhat = \E[\X_{i}\X_{i}']^{-1}\E[\X_{i}Y_{i}]$, is well-defined and unique.
Expand All @@ -17,9 +17,9 @@ We begin by setting out the assumptions we will need for establishing the large-

The linear projection model makes the following assumptions:

1. $\{(Y_{i}, \X_{i})\}_{i=1}^n$ are iid random vectors.
1. $\{(Y_{i}, \X_{i})\}_{i=1}^n$ are iid random vectors

2. $\E[Y_{i}^{2}] < \infty$ (finite outcome variance)
2. $\E[Y^{2}_{i}] < \infty$ (finite outcome variance)

3. $\E[\Vert \X_{i}\Vert^{2}] < \infty$ (finite variances and covariances of covariates)

Expand All @@ -40,7 +40,7 @@ $$
$$
which implies that
$$
\bhat \inprob \beta + \mb{Q}_{\X\X}^{-1}\E[\X_ie_i] = \beta,
\bhat \inprob \bfbeta + \mb{Q}_{\X\X}^{-1}\E[\X_ie_i] = \bfbeta,
$$
by the continuous mapping theorem (the inverse is a continuous function). The linear projection assumptions ensure that LLN applies to these sample means and ensure that $\E[\X_{i}\X_{i}']$ is invertible.

Expand Down Expand Up @@ -281,7 +281,7 @@ If $\mb{L}$ only has one row, our Wald statistic is the same as the squared $t$
$$
t = \frac{\widehat{\beta}_{j} - \beta_{j}}{\widehat{\se}[\widehat{\beta}_{j}]} \indist \N(0,1)
$$
so $t^2$ will converge in distribution to a $\chi^2_1$ (since a $\chi^2_1$ is just one standard normal squared). After recentering ad rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\mb{L}\bhat = \mb{c}$, we have $W \indist \chi^2_{q}$.
so $t^2$ will converge in distribution to a $\chi^2_1$ (since a $\chi^2_1$ is just one standard normal squared). After recentering and rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\mb{L}\bhat = \mb{c}$, we have $W \indist \chi^2_{q}$.
::: {.callout-note}
Expand All @@ -302,7 +302,7 @@ The Wald statistic is not a common test provided by standard statistical softwar
$$
F = \frac{W}{q},
$$
which also typically uses the the homoskedastic variance estimator $\mb{V}^{\texttt{lm}}_{\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution is not really statistically justified, it is slightly more conservative than the $\chi^2_q$ distribution, and the inference will converge as $n\to\infty$. So it might be justified as an *ad hoc* small sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and say we have a sample size of $n = 100$. In that case, the critical value for the F test with $\alpha = 0.05$ is
which also typically uses the homoskedastic variance estimator $\mb{V}^{\texttt{lm}}_{\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution is not really statistically justified, it is slightly more conservative than the $\chi^2_q$ distribution, and the inference will converge as $n\to\infty$. So it might be justified as an *ad hoc* small sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and say we have a sample size of $n = 100$. In that case, the critical value for the F test with $\alpha = 0.05$ is
```{r}
qf(0.95, df1 = 2, df2 = 100 - 4)
Expand Down Expand Up @@ -343,7 +343,7 @@ Under the linear regression model assumption, OLS is unbiased for the population
$$
\E[\bhat \mid \Xmat] = \bfbeta,
$$
and its conditional sampling variance issue
and its conditional sampling variance is
$$
\mb{\V}_{\bhat} = \V[\bhat \mid \Xmat] = \left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \sigma^2_i \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1},
$$
Expand Down Expand Up @@ -396,7 +396,7 @@ where $\overset{a}{\sim}$ means approximately asymptotically distributed as. Und
$$
\mb{V}_{\bhat} = \left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \sigma^2_i \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1} \approx \mb{V}_{\bfbeta} / n
$$
In practice, these two derivations lead to basically the same variance estimator. Recall the heteroskedastic-consistent variance estimator is
In practice, these two derivations lead to basically the same variance estimator. Recall the heteroskedastic-consistent variance estimator
$$
\widehat{\mb{V}}_{\bfbeta} = \left( \frac{1}{n} \Xmat'\Xmat \right)^{-1} \left( \frac{1}{n} \sum_{i=1}^n\widehat{e}_i^2\X_i\X_i' \right) \left( \frac{1}{n} \Xmat'\Xmat \right)^{-1},
$$
Expand Down Expand Up @@ -437,7 +437,7 @@ is unbiased, $\E[\widehat{\mb{V}}^{\texttt{lm}}_{\bhat} \mid \Xmat] = \mb{V}^{\t
:::
::: {.proof}
Under homoskedasticity $\sigma^2_i = \sigma^2$ for all $i$. Recall that $\sum_{i=1}^n \X_i\X_i' = \Xmat'\Xmat$ Thus, the conditional sampling variance from @thm-ols-unbiased,
Under homoskedasticity $\sigma^2_i = \sigma^2$ for all $i$. Recall that $\sum_{i=1}^n \X_i\X_i' = \Xmat'\Xmat$. Thus, the conditional sampling variance from @thm-ols-unbiased,
$$
\begin{aligned}
\V[\bhat \mid \Xmat] &= \left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \sigma^2 \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1} \\ &= \sigma^2\left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1} \\&= \sigma^2\left( \Xmat'\Xmat \right)^{-1}\left( \Xmat'\Xmat \right) \left( \Xmat'\Xmat \right)^{-1} \\&= \sigma^2\left( \Xmat'\Xmat \right)^{-1} = \mb{V}^{\texttt{lm}}_{\bhat}.
Expand All @@ -456,12 +456,12 @@ where the first equality is because $\mb{M}_{\Xmat} = \mb{I}_{n} - \Xmat (\Xmat'
$$
\V[\widehat{e}_i \mid \Xmat] = \E[\widehat{e}_{i}^{2} \mid \Xmat] = (1 - h_{ii})\sigma^{2}.
$$
In the last chapter, we established one property of these leverage values in @sec-leverage is that $\sum_{i=1}^n h_{ii} = k+ 1$, so $\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have
In the last chapter, we established one property of these leverage values in @sec-leverage, namely $\sum_{i=1}^n h_{ii} = k+ 1$, so $\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have
$$
\begin{aligned}
\E[\widehat{\sigma}^{2} \mid \Xmat] &= \frac{1}{n-k-1} \sum_{i=1}^{n} \E[\widehat{e}_{i}^{2} \mid \Xmat] \\
&= \frac{\sigma^{2}}{n-k-1} \sum_{i=1}^{n} 1 - h_{ii} \\
&= \sigma^{2}
&= \sigma^{2}.
\end{aligned}
$$
This establishes $\E[\widehat{\mb{V}}^{\texttt{lm}}_{\bhat} \mid \Xmat] = \mb{V}^{\texttt{lm}}_{\bhat}$.
Expand Down
4 changes: 2 additions & 2 deletions _freeze/06_linear_model/execute-results/html.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/06_linear_model/execute-results/tex.json

Large diffs are not rendered by default.

Binary file modified _freeze/06_linear_model/figure-pdf/fig-blp-limits-1.pdf
Binary file not shown.
Binary file modified _freeze/06_linear_model/figure-pdf/fig-cef-binned-1.pdf
Binary file not shown.
Binary file modified _freeze/06_linear_model/figure-pdf/fig-cef-blp-1.pdf
Binary file not shown.
4 changes: 2 additions & 2 deletions _freeze/07_least_squares/execute-results/html.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/07_least_squares/execute-results/tex.json

Large diffs are not rendered by default.

Binary file modified _freeze/07_least_squares/figure-pdf/fig-ajr-scatter-1.pdf
Binary file not shown.
Binary file modified _freeze/07_least_squares/figure-pdf/fig-influence-1.pdf
Binary file not shown.
Binary file modified _freeze/07_least_squares/figure-pdf/fig-outlier-1.pdf
Binary file not shown.
Binary file modified _freeze/07_least_squares/figure-pdf/fig-ssr-comp-1.pdf
Binary file not shown.
Binary file modified _freeze/07_least_squares/figure-pdf/fig-ssr-vs-tss-1.pdf
Binary file not shown.
4 changes: 2 additions & 2 deletions _freeze/08_ols_properties/execute-results/html.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/08_ols_properties/execute-results/tex.json

Large diffs are not rendered by default.

Binary file modified _freeze/08_ols_properties/figure-pdf/fig-wald-1.pdf
Binary file not shown.

0 comments on commit 6520d71

Please sign in to comment.