Merge pull request #2 from mattblackwell/main

merge
mattblackwell · Dec 4, 2023 · 6520d71 · 6520d71
2 parents fc6ead7 + 786c070
commit 6520d71
Show file tree

Hide file tree

Showing 18 changed files with 30 additions and 30 deletions.
diff --git a/06_linear_model.qmd b/06_linear_model.qmd
@@ -418,7 +418,7 @@ Here, the coefficients are slightly more interpretable:
 
 * $\beta_1$: the marginal effect of $X_{i1}$ on predicted $Y_i$ when $X_{i2} = 0$.
 * $\beta_2$: the marginal effect of $X_{i2}$ on predicted $Y_i$ when $X_{i1} = 0$.
-* $\beta_3$: the change in the marginal effect of $X_{i1}$ due to a one-unit change in $X_{i2}$ **OR** the change in the marginal effect of $X_{i2}$ due to a one-unit change in $X_{i2}$.
+* $\beta_3$: the change in the marginal effect of $X_{i1}$ due to a one-unit change in $X_{i2}$ **OR** the change in the marginal effect of $X_{i2}$ due to a one-unit change in $X_{i1}$.
 
 If we add more covariates to this BLP, these interpretations change to "holding all other covariates constant."
 

diff --git a/07_least_squares.qmd b/07_least_squares.qmd
@@ -284,7 +284,7 @@ $$
 
 ## Rank, linear independence, and multicollinearity {#sec-rank}
 
-When introducing the OLS estimator, we noted that it would exist when $\sum_{i=1}^n \X_i\X_i'$ is positive definite or that there is "no multicollinearity." This assumption is equivalent to saying that the matrix $\mathbb{X}$ is full column rank, meaning that $\text{rank}(\mathbb{X}) = (k+1)$, where $k+1$ is the number of columns of $\mathbb{X}$. Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and **linear independence** means that if $\mathbb{X}\mb{b} = 0$ if and only if $\mb{b}$ is a column vector of 0s. In other words, we have
+When introducing the OLS estimator, we noted that it would exist when $\sum_{i=1}^n \X_i\X_i'$ is positive definite or that there is "no multicollinearity." This assumption is equivalent to saying that the matrix $\mathbb{X}$ is full column rank, meaning that $\text{rank}(\mathbb{X}) = (k+1)$, where $k+1$ is the number of columns of $\mathbb{X}$. Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and **linear independence** means that $\mathbb{X}\mb{b} = 0$ if and only if $\mb{b}$ is a column vector of 0s. In other words, we have
 $$ 
 b_{1}\mathbb{X}_{1} + b_{2}\mathbb{X}_{2} + \cdots + b_{k+1}\mathbb{X}_{k+1} = 0 \quad\iff\quad b_{1} = b_{2} = \cdots = b_{k+1} = 0, 
 $$
@@ -299,7 +299,7 @@ $$
 $$
 In this case, this expression equals 0 when $b_3 = b_4 = \cdots = b_{k+1} = 0$ and $b_1 = -2b_2$. Thus, the collection of columns is linearly dependent, so we know that the rank of $\mathbb{X}$ must be less than full column rank (that is, less than $k+1$). Hopefully, it is also clear that if we removed the problematic column $\mathbb{X}_2$, the resulting matrix would have $k$ linearly independent columns, implying that $\mathbb{X}$ is rank $k$. 
 
-Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that $\Xmat$ if of full column rank if and only if $\Xmat'\Xmat$ is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of $\Xmat$ being linearly independent means that the inverse $(\Xmat'\Xmat)^{-1}$ exists and so does $\bhat$. Further, this full rank condition also implies that $\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'$ is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.
+Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that $\Xmat$ is of full column rank if and only if $\Xmat'\Xmat$ is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of $\Xmat$ being linearly independent means that the inverse $(\Xmat'\Xmat)^{-1}$ exists and so does $\bhat$. Further, this full rank condition also implies that $\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'$ is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.
 
 What are common situations that lead to violations of no multicollinearity? We have seen one above, with one variable being a linear function of another. But this problem can come out in more subtle ways. Suppose that we have a set of dummy variables corresponding to a single categorical variable, like the region of the country. In the US, this might mean we have $X_{i1} = 1$ for units in the West (0 otherwise), $X_{i2} = 1$ for units in the Midwest (0 otherwise), $X_{i3} = 1$ for units in the South (0 otherwise), and $X_{i4} = 1$ for units in the Northeast (0 otherwise). Each unit has to be in one of these four regions, so there is a linear dependence between these variables, 
 $$ 
@@ -333,7 +333,7 @@ Note that these interpretations only hold when the regression consists solely of
 
 OLS has a very nice geometric interpretation that can add a lot of intuition for various aspects of the method. In this geometric approach, we view $\mb{Y}$ as an $n$-dimensional vector in $\mathbb{R}^n$. As we saw above, OLS in matrix form is about finding a linear combination of the covariate matrix $\Xmat$ closest to this vector in terms of the Euclidean distance (which is just the sum of squares). 
 
-Let $\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^2\}$ be the **column space** of the matrix $\Xmat$. This set is all linear combinations of the columns of $\Xmat$ or the set of all possible linear predictions we could obtain from $\Xmat$. Notice that the OLS fitted values, $\Xmat\bhat$, are in this column space. If, as we assume, $\Xmat$ has full column rank of $k+1$, then the column space $\mathcal{C}(\Xmat)$ will be a $k+1$-dimensional surface inside of the larger $n$-dimensional space. If $\Xmat$ has two columns, the column space will be a plane.   
+Let $\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^(k+1)\}$ be the **column space** of the matrix $\Xmat$. This set is all linear combinations of the columns of $\Xmat$ or the set of all possible linear predictions we could obtain from $\Xmat$. Notice that the OLS fitted values, $\Xmat\bhat$, are in this column space. If, as we assume, $\Xmat$ has full column rank of $k+1$, then the column space $\mathcal{C}(\Xmat)$ will be a $k+1$-dimensional surface inside of the larger $n$-dimensional space. If $\Xmat$ has two columns, the column space will be a plane.   
 
 Another interpretation of the OLS estimator is that it finds the linear predictor as the closest point in the column space of $\Xmat$ to the outcome vector $\mb{Y}$. This is called the **projection** of $\mb{Y}$ onto $\mathcal{C}(\Xmat)$. @fig-projection shows this projection for a case with $n=3$ and 2 columns in $\Xmat$. The shaded blue region represents the plane of the column space of $\Xmat$, and we can see that $\Xmat\bhat$ is the closest point to $\mb{Y}$ in that space. That's the whole idea of the OLS estimator: find the linear combination of the columns of $\Xmat$ (a point in the column space) that minimizes the Euclidean distance between that point and the outcome vector (the sum of squared residuals).
 
@@ -431,7 +431,7 @@ The residual regression approach is:
 
 1. Use OLS to regress $\mb{Y}$ on $\Xmat_2$ and obtain residuals $\widetilde{\mb{e}}_2$. 
 2. Use OLS to regress each column of $\Xmat_1$ on $\Xmat_2$ and obtain residuals $\widetilde{\Xmat}_1$.
-3. Use OLS to regression $\widetilde{\mb{e}}_{2}$ on $\widetilde{\Xmat}_1$. 
+3. Use OLS to regress $\widetilde{\mb{e}}_{2}$ on $\widetilde{\Xmat}_1$. 
 
 :::
 
@@ -469,7 +469,7 @@ h_{ii} = \X_{i}'\left(\Xmat'\Xmat\right)^{-1}\X_{i},
 $$
 which is the $i$th diagonal entry of the projection matrix, $\mb{P}_{\Xmat}$. Notice that 
 $$ 
-\widehat{\mb{Y}} = \mb{P}\mb{Y} \qquad \implies \qquad \widehat{Y}_i = \sum_{j=1}^n h_{ij}Y_j,
+\widehat{\mb{Y}} = \mb{P}_{\Xmat}\mb{Y} \qquad \implies \qquad \widehat{Y}_i = \sum_{j=1}^n h_{ij}Y_j,
 $$
 so that $h_{ij}$ is the importance of observation $j$ for the fitted value for observation $i$. The leverage, then, is the importance of the observation for its own fitted value. We can also interpret these values in terms of the distribution of $\X_{i}$. Roughly speaking, these values are the weighted distance $\X_i$ is from $\overline{\X}$, where the weights normalize to the empirical variance/covariance structure of the covariates (so that the scale of each covariate is roughly the same). We can see this most clearly when we fit a simple linear regression (with one covariate and an intercept) with OLS when the leverage is
 $$ 
@@ -545,7 +545,7 @@ text(5, 2, "Full sample", pos = 2, col = "dodgerblue")
 text(7, 7, "Influence Point", pos = 1, col = "indianred")
 ```
 
-One measure of influence is called DFBETA$_i$ measures how much $i$ changes the estimated coefficient vector
+One measure of influence, called DFBETA$_i$, measures how much $i$ changes the estimated coefficient vector
 $$ 
 \bhat - \bhat_{(-i)} = \left( \Xmat'\Xmat\right)^{-1}\X_i\widetilde{e}_i,
 $$

diff --git a/08_ols_properties.qmd b/08_ols_properties.qmd
@@ -6,7 +6,7 @@ In this chapter, we will focus first on the asymptotic properties of OLS because
 
 ## Large-sample properties of OLS
 
-As we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: a consistent estimate of the variance of $\bhat$ and the approximate distribution of $\bhat$ in large samples. Remember that since $\bhat$ is a vector, then the variance of that estimator will actually be a variance-covariance matrix. To obtain these two ingredients, we will first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which will include its variance. 
+As we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: a consistent estimate of the variance of $\bhat$ and the approximate distribution of $\bhat$ in large samples. Remember that since $\bhat$ is a vector, the variance of that estimator will actually be a variance-covariance matrix. To obtain these two ingredients, we will first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which will include its variance. 
 
 
 We begin by setting out the assumptions we will need for establishing the large-sample properties of OLS, which are the same as the assumptions needed to ensure that the best linear predictor, $\bhat = \E[\X_{i}\X_{i}']^{-1}\E[\X_{i}Y_{i}]$, is well-defined and unique. 
@@ -17,9 +17,9 @@ We begin by setting out the assumptions we will need for establishing the large-
 
 The linear projection model makes the following assumptions:
 
-1. $\{(Y_{i}, \X_{i})\}_{i=1}^n$ are iid random vectors. 
+1. $\{(Y_{i}, \X_{i})\}_{i=1}^n$ are iid random vectors
 
-2. $\E[Y_{i}^{2}] < \infty$ (finite outcome variance)
+2. $\E[Y^{2}_{i}] < \infty$ (finite outcome variance)
 
 3. $\E[\Vert \X_{i}\Vert^{2}] < \infty$ (finite variances and covariances of covariates)
 
@@ -40,7 +40,7 @@ $$
 $$
 which implies that 
 $$
-\bhat \inprob \beta + \mb{Q}_{\X\X}^{-1}\E[\X_ie_i] = \beta,
+\bhat \inprob \bfbeta + \mb{Q}_{\X\X}^{-1}\E[\X_ie_i] = \bfbeta,
 $$
 by the continuous mapping theorem (the inverse is a continuous function). The linear projection assumptions ensure that LLN applies to these sample means and ensure that $\E[\X_{i}\X_{i}']$ is invertible. 
 
@@ -281,7 +281,7 @@ If $\mb{L}$ only has one row, our Wald statistic is the same as the squared $t$
 $$ 
 t = \frac{\widehat{\beta}_{j} - \beta_{j}}{\widehat{\se}[\widehat{\beta}_{j}]} \indist \N(0,1)
 $$
-so $t^2$ will converge in distribution to a $\chi^2_1$ (since a $\chi^2_1$ is just one standard normal squared). After recentering ad rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\mb{L}\bhat = \mb{c}$, we have $W \indist \chi^2_{q}$. 
+so $t^2$ will converge in distribution to a $\chi^2_1$ (since a $\chi^2_1$ is just one standard normal squared). After recentering and rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\mb{L}\bhat = \mb{c}$, we have $W \indist \chi^2_{q}$. 
 
 
 ::: {.callout-note}
@@ -302,7 +302,7 @@ The Wald statistic is not a common test provided by standard statistical softwar
 $$ 
 F = \frac{W}{q},
 $$
-which also typically uses the the homoskedastic variance estimator $\mb{V}^{\texttt{lm}}_{\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution is not really statistically justified, it is slightly more conservative than the $\chi^2_q$ distribution, and the inference will converge as $n\to\infty$. So it might be justified as an *ad hoc* small sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and say we have a sample size of $n = 100$. In that case, the critical value for the F test with $\alpha = 0.05$ is
+which also typically uses the homoskedastic variance estimator $\mb{V}^{\texttt{lm}}_{\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution is not really statistically justified, it is slightly more conservative than the $\chi^2_q$ distribution, and the inference will converge as $n\to\infty$. So it might be justified as an *ad hoc* small sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and say we have a sample size of $n = 100$. In that case, the critical value for the F test with $\alpha = 0.05$ is
 
 ```{r}
 qf(0.95, df1 = 2, df2 = 100 - 4)
@@ -343,7 +343,7 @@ Under the linear regression model assumption, OLS is unbiased for the population
 $$
 \E[\bhat \mid \Xmat] = \bfbeta,
 $$
-and its conditional sampling variance issue
+and its conditional sampling variance is
 $$
 \mb{\V}_{\bhat} = \V[\bhat \mid \Xmat] = \left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \sigma^2_i \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1},
 $$
@@ -396,7 +396,7 @@ where $\overset{a}{\sim}$ means approximately asymptotically distributed as. Und
 $$
 \mb{V}_{\bhat} = \left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \sigma^2_i \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1} \approx \mb{V}_{\bfbeta} / n
 $$
-In practice, these two derivations lead to basically the same variance estimator. Recall the heteroskedastic-consistent variance estimator is
+In practice, these two derivations lead to basically the same variance estimator. Recall the heteroskedastic-consistent variance estimator
 $$
 \widehat{\mb{V}}_{\bfbeta} = \left( \frac{1}{n} \Xmat'\Xmat \right)^{-1} \left( \frac{1}{n} \sum_{i=1}^n\widehat{e}_i^2\X_i\X_i' \right) \left( \frac{1}{n} \Xmat'\Xmat \right)^{-1},
 $$
@@ -437,7 +437,7 @@ is unbiased, $\E[\widehat{\mb{V}}^{\texttt{lm}}_{\bhat} \mid \Xmat] = \mb{V}^{\t
 ::: 
 
 ::: {.proof}
-Under homoskedasticity $\sigma^2_i = \sigma^2$ for all $i$. Recall that $\sum_{i=1}^n \X_i\X_i' = \Xmat'\Xmat$ Thus, the conditional sampling variance from @thm-ols-unbiased, 
+Under homoskedasticity $\sigma^2_i = \sigma^2$ for all $i$. Recall that $\sum_{i=1}^n \X_i\X_i' = \Xmat'\Xmat$. Thus, the conditional sampling variance from @thm-ols-unbiased, 
 $$ 
 \begin{aligned}
 \V[\bhat \mid \Xmat] &= \left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \sigma^2 \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1} \\ &= \sigma^2\left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1} \\&= \sigma^2\left( \Xmat'\Xmat \right)^{-1}\left( \Xmat'\Xmat \right) \left( \Xmat'\Xmat \right)^{-1} \\&= \sigma^2\left( \Xmat'\Xmat \right)^{-1} = \mb{V}^{\texttt{lm}}_{\bhat}.
@@ -456,12 +456,12 @@ where the first equality is because $\mb{M}_{\Xmat} = \mb{I}_{n} - \Xmat (\Xmat'
 $$ 
 \V[\widehat{e}_i \mid \Xmat] = \E[\widehat{e}_{i}^{2} \mid \Xmat] = (1 - h_{ii})\sigma^{2}.
 $$
-In the last chapter, we established one property of these leverage values in @sec-leverage is that $\sum_{i=1}^n h_{ii} = k+ 1$, so $\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have
+In the last chapter, we established one property of these leverage values in @sec-leverage, namely $\sum_{i=1}^n h_{ii} = k+ 1$, so $\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have
 $$ 
 \begin{aligned}
   \E[\widehat{\sigma}^{2} \mid \Xmat] &= \frac{1}{n-k-1} \sum_{i=1}^{n} \E[\widehat{e}_{i}^{2} \mid \Xmat] \\
                                       &= \frac{\sigma^{2}}{n-k-1} \sum_{i=1}^{n} 1 - h_{ii} \\
-                                      &= \sigma^{2}
+                                      &= \sigma^{2}. 
 \end{aligned}
 $$
 This establishes $\E[\widehat{\mb{V}}^{\texttt{lm}}_{\bhat} \mid \Xmat] = \mb{V}^{\texttt{lm}}_{\bhat}$. 

diff --git a/_freeze/06_linear_model/execute-results/html.json b/_freeze/06_linear_model/execute-results/html.json
diff --git a/_freeze/06_linear_model/execute-results/tex.json b/_freeze/06_linear_model/execute-results/tex.json
diff --git a/_freeze/06_linear_model/figure-pdf/fig-blp-limits-1.pdf b/_freeze/06_linear_model/figure-pdf/fig-blp-limits-1.pdf
diff --git a/_freeze/06_linear_model/figure-pdf/fig-cef-binned-1.pdf b/_freeze/06_linear_model/figure-pdf/fig-cef-binned-1.pdf
diff --git a/_freeze/06_linear_model/figure-pdf/fig-cef-blp-1.pdf b/_freeze/06_linear_model/figure-pdf/fig-cef-blp-1.pdf
diff --git a/_freeze/07_least_squares/execute-results/html.json b/_freeze/07_least_squares/execute-results/html.json
diff --git a/_freeze/07_least_squares/execute-results/tex.json b/_freeze/07_least_squares/execute-results/tex.json
diff --git a/_freeze/07_least_squares/figure-pdf/fig-ajr-scatter-1.pdf b/_freeze/07_least_squares/figure-pdf/fig-ajr-scatter-1.pdf
diff --git a/_freeze/07_least_squares/figure-pdf/fig-influence-1.pdf b/_freeze/07_least_squares/figure-pdf/fig-influence-1.pdf
diff --git a/_freeze/07_least_squares/figure-pdf/fig-outlier-1.pdf b/_freeze/07_least_squares/figure-pdf/fig-outlier-1.pdf
diff --git a/_freeze/07_least_squares/figure-pdf/fig-ssr-comp-1.pdf b/_freeze/07_least_squares/figure-pdf/fig-ssr-comp-1.pdf
diff --git a/_freeze/07_least_squares/figure-pdf/fig-ssr-vs-tss-1.pdf b/_freeze/07_least_squares/figure-pdf/fig-ssr-vs-tss-1.pdf
diff --git a/_freeze/08_ols_properties/execute-results/html.json b/_freeze/08_ols_properties/execute-results/html.json
diff --git a/_freeze/08_ols_properties/execute-results/tex.json b/_freeze/08_ols_properties/execute-results/tex.json
diff --git a/_freeze/08_ols_properties/figure-pdf/fig-wald-1.pdf b/_freeze/08_ols_properties/figure-pdf/fig-wald-1.pdf