From ed812d34ae19dcd835d8b12ac89105c6d93ff426 Mon Sep 17 00:00:00 2001 From: Matthew Blackwell Date: Thu, 25 Jul 2024 10:24:10 -0400 Subject: [PATCH] transpose typo (fixes #66) --- .../ols_properties/execute-results/html.json | 5 +- .../ols_properties/execute-results/tex.json | 5 +- .../ols_properties/figure-pdf/fig-wald-1.pdf | Bin 59497 -> 59497 bytes ols_properties.qmd | 8 +- users-guide.tex | 683 ++++++++++-------- 5 files changed, 397 insertions(+), 304 deletions(-) diff --git a/_freeze/ols_properties/execute-results/html.json b/_freeze/ols_properties/execute-results/html.json index 19cf150..38bfd19 100644 --- a/_freeze/ols_properties/execute-results/html.json +++ b/_freeze/ols_properties/execute-results/html.json @@ -1,8 +1,7 @@ { - "hash": "f6ce8f5973eb6b0086e9e94e9fa87e6a", + "hash": "9969498f97d6e5e726d77c03cffc4a2a", "result": { - "engine": "knitr", - "markdown": "\n\n# The statistics of least squares {#sec-ols-statistics}\n\nThe last chapter showcased the least squares estimator and investigated many of its more mechanical properties, which are essential for the practical application of OLS. But we still need to understand its statistical properties, as we discussed in Part I of this book: unbiasedness, sampling variance, consistency, and asymptotic normality. As we saw then, these properties fall into finite-sample (unbiasedness, sampling variance) and asymptotic (consistency, asymptotic normality). \n\nIn this chapter, we will focus on the asymptotic properties of OLS because those properties hold under the relatively mild conditions of the linear projection model introduced in @sec-linear-projection. We will see that OLS consistently estimates a coherent quantity of interest (the best linear predictor) regardless of whether the conditional expectation is linear. That is, for the asymptotic properties of the estimator, we will not need the commonly invoked linearity assumption. Later, when we investigate the finite-sample properties, we will show how linearity will help us establish unbiasedness and also how the normality of the errors can allow us to conduct exact, finite-sample inference. But these assumptions are very strong, so understanding what we can say about OLS without them is vital. \n\n## Large-sample properties of OLS\n\nAs we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: (1) a consistent estimate of the variance of $\\bhat$ and (2) the approximate distribution of $\\bhat$ in large samples. Remember that, since $\\bhat$ is a vector, the variance of that estimator will actually be a variance-covariance matrix. To obtain the two key ingredients, we first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which includes its variance. \n\n\nWe begin by setting out the assumptions needed for establishing the large-sample properties of OLS, which are the same as the assumptions needed to ensure that the best linear predictor, $\\bfbeta = \\E[\\X_{i}\\X_{i}']^{-1}\\E[\\X_{i}Y_{i}]$, is well-defined and unique. \n\n::: {.callout-note}\n\n### Linear projection assumptions\n\nThe linear projection model makes the following assumptions:\n\n1. $\\{(Y_{i}, \\X_{i})\\}_{i=1}^n$ are iid random vectors\n\n2. $\\E[Y^{2}_{i}] < \\infty$ (finite outcome variance)\n\n3. $\\E[\\Vert \\X_{i}\\Vert^{2}] < \\infty$ (finite variances and covariances of covariates)\n\n2. $\\E[\\X_{i}\\X_{i}']$ is positive definite (no linear dependence in the covariates)\n:::\n\n\nRecall that these are mild conditions on the joint distribution of $(Y_{i}, \\X_{i})$ and in particular, we are **not** assuming linearity of the CEF, $\\E[Y_{i} \\mid \\X_{i}]$, nor are we assuming any specific distribution for the data. \n\nWe can helpfully decompose the OLS estimator into the actual BLP coefficient plus estimation error as\n$$ \n\\bhat = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_iY_i \\right) = \\bfbeta + \\underbrace{\\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\right)}_{\\text{estimation error}}.\n$$ \n \nThis decomposition will help us quickly establish the consistency of $\\bhat$. By the law of large numbers, we know that sample means will converge in probability to population expectations, so we have\n$$ \n\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\inprob \\E[\\X_i\\X_i'] \\equiv \\mb{Q}_{\\X\\X} \\qquad \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\inprob \\E[\\X_{i} e_{i}] = \\mb{0},\n$$\nwhich implies by the continuous mapping theorem (the inverse is a continuous function) that \n$$\n\\bhat \\inprob \\bfbeta + \\mb{Q}_{\\X\\X}^{-1}\\E[\\X_ie_i] = \\bfbeta,\n$$\nThe linear projection assumptions ensure that the LLN applies to these sample means and that $\\E[\\X_{i}\\X_{i}']$ is invertible. \n\n\n::: {#thm-ols-consistency}\nUnder the above linear projection assumptions, the OLS estimator is consistent for the best linear projection coefficients, $\\bhat \\inprob \\bfbeta$.\n:::\n\nThus, OLS should be close to the population linear regression in large samples under relatively mild conditions. Remember that this may not equal the conditional expectation if the CEF is nonlinear. What we can say is that OLS converges to the best *linear* approximation to the CEF. Of course, this also means that, if the CEF is linear, then OLS will consistently estimate the coefficients of the CEF. \n\nTo emphasize, the only assumptions made about the dependent variable are that it (1) has finite variance and (2) is iid. Under this assumption, the outcome could be continuous, categorical, binary, or event count. \n\n\nNext, we would like to establish an asymptotic normality result for the OLS coefficients. We first review some key ideas about the Central Limit Theorem.\n\n::: {.callout-note}\n\n## CLT reminder\n\nSuppose that we have a function of the data iid random vectors $\\X_1, \\ldots, \\X_n$, $g(\\X_{i})$ where $\\E[g(\\X_{i})] = 0$ and so $\\V[g(\\X_{i})] = \\E[g(\\X_{i})g(\\X_{i})']$. Then if $\\E[\\Vert g(\\X_{i})\\Vert^{2}] < \\infty$, the CLT implies that\n$$ \n\\sqrt{n}\\left(\\frac{1}{n} \\sum_{i=1}^{n} g(\\X_{i}) - \\E[g(\\X_{i})]\\right) = \\frac{1}{\\sqrt{n}} \\sum_{i=1}^{n} g(\\X_{i}) \\indist \\N(0, \\E[g(\\X_{i})g(\\X_{i}')]) \n$$ {#eq-clt-mean-zero}\n:::\n\nWe now manipulate our decomposition to arrive at the *stabilized* version of the estimator,\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\right).\n$$\nRecall that we stabilize an estimator to ensure it has a fixed variance as the sample size grows, allowing it to have a non-degenerate asymptotic distribution. The stabilization works by asymptotically centering it (that is, subtracting the value to which it converges) and multiplying by the square root of the sample size. We have already established that the first term on the right-hand side will converge in probability to $\\mb{Q}_{\\X\\X}^{-1}$. Notice that $\\E[\\X_{i}e_{i}] = 0$, so we can apply @eq-clt-mean-zero to the second term. The covariance matrix of $\\X_ie_{i}$ is \n$$ \n\\mb{\\Omega} = \\V[\\X_{i}e_{i}] = \\E[\\X_{i}e_{i}(\\X_{i}e_{i})'] = \\E[e_{i}^{2}\\X_{i}\\X_{i}'].\n$$ \nThe CLT will imply that\n$$ \n\\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\indist \\N(0, \\mb{\\Omega}).\n$$\nCombining these facts with Slutsky's Theorem implies the following theorem. \n\n::: {#thm-ols-asymptotic-normality}\n\nSuppose that the linear projection assumptions hold and, in addition, we have $\\E[Y_{i}^{4}] < \\infty$ and $\\E[\\lVert\\X_{i}\\rVert^{4}] < \\infty$. Then the OLS estimator is asymptotically normal with\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) \\indist \\N(0, \\mb{V}_{\\bfbeta}),\n$$\nwhere\n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1}.\n$$\n\n:::\n\nThus, with a large enough sample size we can approximate the distribution of $\\bhat$ with a multivariate normal distribution with mean $\\bfbeta$ and covariance matrix $\\mb{V}_{\\bfbeta}/n$. In particular, the square root of the $j$th diagonals of this matrix will be standard errors for $\\widehat{\\beta}_j$. Knowing the shape of the OLS estimator's multivariate distribution will allow us to conduct hypothesis tests and generate confidence intervals for both individual coefficients and groups of coefficients. But, first, we need an estimate of the covariance matrix.\n\n\n\n## Variance estimation for OLS\n\nThe asymptotic normality of OLS from the last section is of limited value without some way to estimate the covariance matrix, \n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1}.\n$$\nSince each term here is a population mean, this is an ideal place in which to drop a plug-in estimator. For now, we will use the following estimators:\n$$ \n\\begin{aligned}\n \\mb{Q}_{\\X\\X} &= \\E[\\X_{i}\\X_{i}'] & \\widehat{\\mb{Q}}_{\\X\\X} &= \\frac{1}{n} \\sum_{i=1}^{n} \\X_{i}\\X_{i}' = \\frac{1}{n}\\Xmat'\\Xmat \\\\\n \\mb{\\Omega} &= \\E[e_i^2\\X_i\\X_i'] & \\widehat{\\mb{\\Omega}} & = \\frac{1}{n}\\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i'.\n\\end{aligned}\n$$\nUnder the assumptions of @thm-ols-asymptotic-normality, the LLN will imply that these are consistent for the quantities we need, $\\widehat{\\mb{Q}}_{\\X\\X} \\inprob \\mb{Q}_{\\X\\X}$ and $\\widehat{\\mb{\\Omega}} \\inprob \\mb{\\Omega}$. We can plug these into the variance formula to arrive at\n$$ \n\\begin{aligned}\n \\widehat{\\mb{V}}_{\\bfbeta} &= \\widehat{\\mb{Q}}_{\\X\\X}^{-1}\\widehat{\\mb{\\Omega}}\\widehat{\\mb{Q}}_{\\X\\X}^{-1} \\\\\n &= \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n\\end{aligned}\n$$\nwhich by the continuous mapping theorem is consistent, $\\widehat{\\mb{V}}_{\\bfbeta} \\inprob \\mb{V}_{\\bfbeta}$. \n\nThis estimator is sometimes called the **robust variance estimator** or, more accurately, the **heteroskedasticity-consistent (HC) variance estimator**. Why is it robust? Consider the standard **homoskedasticity** assumption that most statistical software packages make when estimating OLS variances: the variance of the errors does not depend on the covariates, or $\\V[e_{i}^{2} \\mid \\X_{i}] = \\V[e_{i}^{2}]$. This assumption is stronger than needed, and we can rely on a weaker assumption that the squared errors are uncorrelated with a specific function of the covariates: \n$$ \n\\E[e_{i}^{2}\\X_{i}\\X_{i}'] = \\E[e_{i}^{2}]\\E[\\X_{i}\\X_{i}'] = \\sigma^{2}\\mb{Q}_{\\X\\X}, \n$$\nwhere $\\sigma^2$ is the variance of the residuals (since $\\E[e_{i}] = 0$). Homoskedasticity simplifies the asymptotic variance of the stabilized estimator, $\\sqrt{n}(\\bhat - \\bfbeta)$, to\n$$ \n\\mb{V}^{\\texttt{lm}}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\sigma^{2}\\mb{Q}_{\\X\\X}\\mb{Q}_{\\X\\X}^{-1} = \\sigma^2\\mb{Q}_{\\X\\X}^{-1}.\n$$\nWe already have an estimator for $\\mb{Q}_{\\X\\X}$, but we need one for $\\sigma^2$. We can easily use the SSR,\n$$ \n\\widehat{\\sigma}^{2} = \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\widehat{e}_{i}^{2},\n$$\nwhere we use $n-k-1$ in the denominator instead of $n$ to correct for the residuals being slightly less variable than the actual errors (because OLS mechanically attempts to make the residuals small). For consistent variance estimation, $n-k -1$ or $n$ can be used, since either way $\\widehat{\\sigma}^2 \\inprob \\sigma^2$. Thus, under homoskedasticity, we have\n$$ \n\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}} = \\widehat{\\sigma}^{2}\\left(\\frac{1}{n}\\Xmat'\\Xmat\\right)^{{-1}} = n\\widehat{\\sigma}^{2}\\left(\\Xmat'\\Xmat\\right)^{{-1}},\n$$\nThis is the standard variance estimator used by `lm()` in R and `reg` in Stata. \n\n\nHow do these two estimators, $\\widehat{\\mb{V}}_{\\bfbeta}$ and $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$, compare? Notice that the HC variance estimator and the homoskedasticity variance estimator will both be consistent when homoskedasticity holds. But as the \"heteroskedasticity-consistent\" label implies, only the HC variance estimator will be consistent when homoskedasticity fails to hold. So $\\widehat{\\mb{V}}_{\\bfbeta}$ has the advantage of being consistent regardless of the homoskedasticity assumption. This advantage comes at a cost, however. When homoskedasticity is correct, $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$ incorporates that assumption into the estimator whereas the HC variance estimator has to estimate it. The HC estimator will therefore have higher variance (the variance estimator will be more variable!) when homoskedasticity actually does hold. \n\n\n\n\n\nNow that we have established the asymptotic normality of the OLS estimator and developed a consistent estimator of its variance, we can proceed with all of the statistical inference tools we discussed in Part I, including hypothesis tests and confidence intervals. \n\nWe begin by defining the estimated **heteroskedasticity-consistent standard errors** as\n$$ \n\\widehat{\\se}(\\widehat{\\beta}_{j}) = \\sqrt{\\frac{[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}}{n}},\n$$\nwhere $[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}$ is the $j$th diagonal entry of the HC variance estimator. Note that we divide by $\\sqrt{n}$ here because $\\widehat{\\mb{V}}_{\\bfbeta}$ is a consistent estimator of the stabilized estimator $\\sqrt{n}(\\bhat - \\bfbeta)$ not the estimator itself. \n\nHypothesis tests and confidence intervals for individual coefficients are almost precisely the same as with the most general case presented in Part I. For a two-sided test of $H_0: \\beta_j = b$ versus $H_1: \\beta_j \\neq b$, we can build the t-statistic and conclude that, under the null,\n$$\n\\frac{\\widehat{\\beta}_j - b}{\\widehat{\\se}(\\widehat{\\beta}_{j})} \\indist \\N(0, 1).\n$$\nStatistical software will typically and helpfully provide the t-statistic for the null hypothesis of no (partial) linear relationship between $X_{ij}$ and $Y_i$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nwhich measures how large the estimated coefficient is in standard errors. With $\\alpha = 0.05$, asymptotic normality would imply that we reject this null when $t > 1.96$. We can form asymptotically-valid confidence intervals with \n$$ \n\\left[\\widehat{\\beta}_{j} - z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j}),\\;\\widehat{\\beta}_{j} + z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j})\\right]. \n$$\nFor reasons we will discuss below, standard software typically relies on the $t$ distribution instead of the normal for hypothesis testing and confidence intervals. Still, this difference is of little consequence in large samples. \n\n## Inference for multiple parameters\n\nWith multiple coefficients, we might have hypotheses that involve more than one coefficient. As an example, consider a regression with an interaction between two covariates, \n$$\nY_i = \\beta_0 + X_i\\beta_1 + Z_i\\beta_2 + X_iZ_i\\beta_3 + e_i.\n$$\nSuppose we wanted to test the hypothesis that $X_i$ does not affect the best linear predictor for $Y_i$. That would be\n$$ \nH_{0}: \\beta_{1} = 0 \\text{ and } \\beta_{3} = 0\\quad\\text{vs}\\quad H_{1}: \\beta_{1} \\neq 0 \\text{ or } \\beta_{3} \\neq 0,\n$$\nwhere we usually write the null more compactly as $H_0: \\beta_1 = \\beta_3 = 0$. \n\nTo test this null hypothesis, we need a test statistic that discriminates between the two hypotheses: it should be large when the alternative is true and small enough when the null is true. With a single coefficient, we usually test the null hypothesis of $H_0: \\beta_j = b_0$ with the $t$-statistic, \n$$ \nt = \\frac{\\widehat{\\beta}_{j} - b_{0}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nand we usually take the absolute value, $|t|$, as our measure of how extreme our estimate is given the null distribution. But notice that we could also use the square of the $t$ statistic, which is\n$$ \nt^{2} = \\frac{\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{\\V[\\widehat{\\beta}_{j}]} = \\frac{n\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{[\\mb{V}_{\\bfbeta}]_{[jj]}}. \n$$ {#eq-squared-t}\n\nWhile $|t|$ is the usual test statistic we use for two-sided tests, we could equivalently use $t^2$ and arrive at the exact same conclusions (as long as we knew the distribution of $t^2$ under the null hypothesis). It turns out that the $t^2$ version of the test statistic will generalize more easily to comparing multiple coefficients. This version of the test statistic suggests another general way to differentiate the null from the alternative: by taking the squared distance between them and dividing by the variance of the estimate. \n\nCan we generalize this idea to hypotheses about multiple parameters? Adding the sum of squared distances for each component of the null hypothesis is straightforward. For our interaction example, that would be\n$$ \n\\widehat{\\beta}_1^2 + \\widehat{\\beta}_3^2, \n$$\nRemember, however, that some of the estimated coefficients are noisier than others, so we should account for the uncertainty just like we did for the $t$-statistic. \n\nWith multiple parameters and multiple coefficients, the variances will now require matrix algebra. We can write any hypothesis about linear functions of the coefficients as $H_{0}: \\mb{L}\\bfbeta = \\mb{c}$. For example, in the interaction case, we have\n$$ \n\\mb{L} =\n\\begin{pmatrix}\n 0 & 1 & 0 & 0 \\\\\n 0 & 0 & 0 & 1 \\\\\n\\end{pmatrix}\n\\qquad\n\\mb{c} =\n\\begin{pmatrix}\n 0 \\\\\n 0\n\\end{pmatrix}\n$$\nThus, $\\mb{L}\\bfbeta = \\mb{0}$ is equivalent to $\\beta_1 = 0$ and $\\beta_3 = 0$. Notice that with other $\\mb{L}$ matrices, we could represent more complicated hypotheses like $2\\beta_1 - \\beta_2 = 34$, though we mostly stick to simpler functions. Let $\\widehat{\\bs{\\theta}} = \\mb{L}\\bhat$ be the OLS estimate of the function of the coefficients. By the delta method (discussed in @sec-delta-method), we have\n$$ \n\\sqrt{n}\\left(\\mb{L}\\bhat - \\mb{L}\\bfbeta\\right) \\indist \\N(0, \\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}).\n$$\nWe can now generalize the squared $t$ statistic in @eq-squared-t by taking the distances $\\mb{L}\\bhat - \\mb{c}$ weighted by the variance-covariance matrix $\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}$, \n$$ \nW = n(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}(\\mb{L}\\bhat - \\mb{c}),\n$$\nwhich is called the **Wald test statistic**. This statistic generalizes the ideas of the t-statistic to multiple parameters. With the t-statistic, we recenter to have mean 0 and divide by the standard error to get a variance of 1. If we ignore the middle variance weighting, we have $(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}\\bhat - \\mb{c})$ which is just the sum of the squared deviations of the estimates from the null. Including the $(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}$ weight has the effect of rescaling the distribution of $\\mb{L}\\bhat - \\mb{c}$ to make it rotationally symmetric around 0 (so the resulting dimensions are uncorrelated) with each dimension having an equal variance of 1. In this way, the Wald statistic transforms the random vectors to be mean-centered and have variance 1 (just the t-statistic), but also to have the resulting random variables in the vector be uncorrelated.[^norms]\n\n\n[^norms]: The form of the Wald statistic is that of a weighted inner product, $\\mb{x}'\\mb{Ay}$, where $\\mb{A}$ is a symmetric positive-definite weighting matrix. \n\nWhy transform the data in this way? @fig-wald shows the contour plot of a hypothetical joint distribution of two coefficients from an OLS regression. We might want to know the distance between different points in the distribution and the mean, which in this case is $(1, 2)$. Without considering the joint distribution, the circle is obviously closer to the mean than the triangle. However, looking at the two points on the distribution, the circle is at a lower contour than the triangle, meaning it is more extreme than the triangle for this particular distribution. The Wald statistic, then, takes into consideration how much of a \"climb\" it is for $\\mb{L}\\bhat$ to get to $\\mb{c}$ given the distribution of $\\mb{L}\\bhat$.\n\n\n\n\n\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of two slope coefficients. The circle is closer to the center of the distribution by the standard Euclidean distance, but the triangle is closer once you consider the joint distribution.](ols_properties_files/figure-html/fig-wald-1.png){#fig-wald width=672}\n:::\n:::\n\n\n\n\n\n\n\n\n\nIf $\\mb{L}$ only has one row, our Wald statistic is the same as the squared $t$ statistic, $W = t^2$. This fact will help us think about the asymptotic distribution of $W$. Note that as $n\\to\\infty$, we know that by the asymptotic normality of $\\bhat$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{\\widehat{\\se}[\\widehat{\\beta}_{j}]} \\indist \\N(0,1)\n$$\nso $t^2$ will converge in distribution to a $\\chi^2_1$ (since a $\\chi^2_1$ distribution is just one standard normal distribution squared). After recentering and rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\\mb{L}\\bhat = \\mb{c}$, we have $W \\indist \\chi^2_{q}$. \n\n\nWe need to define the rejection region to use the Wald statistic in a hypothesis test. Because we are squaring each distance in $W \\geq 0$, larger values of $W$ indicate more disagreement with the null in either direction. Thus, for an $\\alpha$-level test of the joint null, we only need a one-sided rejection region of the form $\\P(W > w_{\\alpha}) = \\alpha$. Obtaining these values is straightforward (see the above callout tip). For $q = 2$ and a $\\alpha = 0.05$, the critical value is roughly 6. \n\n\n\n::: {.callout-note}\n\n## Chi-squared critical values\n\nWe can obtain critical values for the $\\chi^2_q$ distribution using the `qchisq()` function in R. For example, if we wanted to obtain the critical value $w$ such that $\\P(W > w_{\\alpha}) = \\alpha$ for our two-parameter interaction example, we could use:\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqchisq(p = 0.95, df = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 5.991465\n```\n\n\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n\nThe Wald statistic is not a common test provided by standard statistical software functions like `lm()` in R, though it is fairly straightforward to implement \"by hand.\" Alternatively, packages like [`{aod}`](https://cran.r-project.org/web/packages/aod/index.html) or [`{clubSandwich}`](http://jepusto.github.io/clubSandwich/) have implementations of the test. What is reported by most software implementations of OLS (like `lm()` in R) is the F-statistic, which is\n$$ \nF = \\frac{W}{q}.\n$$\nThis also typically uses the homoskedastic variance estimator $\\mb{V}^{\\texttt{lm}}_{\\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution has no justification in statistical theory, but it is slightly more conservative than the $\\chi^2_q$ distribution, and the inferences from the $F$ statistic will converge to those from the $\\chi^2_q$ distribution as $n\\to\\infty$. So it might be justified as an *ad hoc* small-sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and we have, say, a sample size of $n = 100$, then in that case, the critical value for the F test with $\\alpha = 0.05$ is\n\n\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqf(0.95, df1 = 2, df2 = 100 - 4)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.091191\n```\n\n\n:::\n:::\n\n\n\n\n\n\n\n\nThis result implies a critical value of 6.182 on the scale of the Wald statistic (multiplying it by $q = 2$). Compared to the earlier critical value of 5.991 based on the $\\chi^2_2$ distribution, we can see that the inferences will be very similar even in moderately-sized datasets. \n\nFinally, note that the F-statistic reported by `lm()` in R is the test of all the coefficients being equal to 0 jointly except for the intercept. In modern quantitative social sciences, this test is seldom substantively interesting. \n\n\n## Finite-sample properties with a linear CEF\n\nAll the above results have been large-sample properties, and we have not addressed finite-sample properties like the sampling variance or unbiasedness. Under the linear projection assumption above, OLS is generally biased without stronger assumptions. This section introduces the stronger assumption that will allow us to establish stronger properties for OLS. As usual, however, remember that these stronger assumptions can be wrong. \n\n\n::: {.callout-note}\n## Assumption: Linear Regression Model\n1. The variables $(Y_{i}, \\X_{i})$ satisfy the linear CEF assumption.\n$$ \n\\begin{aligned}\n Y_{i} &= \\X_{i}'\\bfbeta + e_{i} \\\\\n \\E[e_{i}\\mid \\X_{i}] & = 0.\n\\end{aligned}\n$$\n\n2. The design matrix is invertible $\\E[\\X_{i}\\X_{i}'] > 0$ (positive definite).\n:::\n\n\nWe discussed the concept of a linear CEF extensively in @sec-regression. However, recall that the CEF might be linear mechanically if the model is **saturated** or when there are as many coefficients in the model as there are unique values of $\\X_i$. When a model is not saturated, the linear CEF assumption is just that: an assumption. What can this assumption do? It can aid in establishing some nice statistical properties in finite samples. \n\nBefore proceeding, note that, when focusing on the finite sample inference for OLS, we focused on its properties **conditional on the observed covariates**, such as $\\E[\\bhat \\mid \\Xmat]$ or $\\V[\\bhat \\mid \\Xmat]$. The historical reason for this is that the researcher often chose these independent variables and so they were not random. Thus, sometimes $\\Xmat$ is treated as \"fixed\" in some older texts, which might even omit explicit conditioning statements. \n\n\n::: {#thm-ols-unbiased}\n\nUnder the linear regression model assumption, OLS is unbiased for the population regression coefficients, \n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta,\n$$\nand its conditional sampling variance is\n$$\n\\mb{\\V}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nwhere $\\sigma^2_{i} = \\E[e_{i}^{2} \\mid \\Xmat]$. \n:::\n\n\n::: {.proof}\n\nTo prove the conditional unbiasedness, recall that we can write the OLS estimator as\n$$\n\\bhat = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e},\n$$\nand so taking (conditional) expectations, we have\n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta + \\E[(\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\E[\\mb{e} \\mid \\Xmat] = \\bfbeta,\n$$\nbecause under the linear CEF assumption $\\E[\\mb{e}\\mid \\Xmat] = 0$. \n\nFor the conditional sampling variance, we can use the same decomposition we have,\n$$\n\\V[\\bhat \\mid \\Xmat] = \\V[\\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = (\\Xmat'\\Xmat)^{-1}\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat(\\Xmat'\\Xmat)^{-1}. \n$$\nSince $\\E[\\mb{e}\\mid \\Xmat] = 0$, we know that $\\V[\\mb{e}\\mid \\Xmat] = \\E[\\mb{ee}' \\mid \\Xmat]$, which is a matrix with diagonal entries $\\E[e_{i}^{2} \\mid \\Xmat] = \\sigma^2_i$ and off-diagonal entries $\\E[e_{i}e_{j} \\Xmat] = \\E[e_{i}\\mid \\Xmat]\\E[e_{j}\\mid\\Xmat] = 0$, where the first equality follows from the independence of the errors across units. Thus, $\\V[\\mb{e} \\mid \\Xmat]$ is a diagonal matrix with $\\sigma^2_i$ along the diagonal, which means\n$$\n\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat = \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i',\n$$\nestablishing the conditional sampling variance.\n \n:::\n\nThis means that, for any realization of the covariates, $\\Xmat$, OLS is unbiased for the true regression coefficients $\\bfbeta$. By the law of iterated expectation, we also know that it is unconditionally unbiased[^unconditional] as well since\n$$\n\\E[\\bhat] = \\E[\\E[\\bhat \\mid \\Xmat]] = \\bfbeta. \n$$\nThe difference between these two statements usually isn't incredibly meaningful. \n\n[^unconditional]: We are basically ignoring some edge cases when it comes to discrete covariates here. In particular, we assume that $\\Xmat'\\Xmat$ is nonsingular with probability one. However, this assumption can fail if we have a binary covariate since there is some chance (however slight) that the entire column will be all ones or all zeros, which would lead to a singular matrix $\\Xmat'\\Xmat$. Practically this is not a big deal, but it does mean that we have to ignore this issue theoretically or focus on conditional unbiasedness. \n\n\nThere are a lot of variances flying around, so reviewing them is helpful. Above, we derived the asymptotic variance of $\\mb{Z}_{n} = \\sqrt{n}(\\bhat - \\bfbeta)$, \n$$\n\\mb{V}_{\\bfbeta} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1},\n$$\nwhich implies that the approximate variance of $\\bhat$ will be $\\mb{V}_{\\bfbeta} / n$ because\n$$\n\\bhat = \\frac{Z_n}{\\sqrt{n}} + \\bfbeta \\quad\\implies\\quad \\bhat \\overset{a}{\\sim} \\N(\\bfbeta, n^{-1}\\mb{V}_{\\bfbeta}),\n$$\nwhere $\\overset{a}{\\sim}$ means asymptotically distributed as. Under the linear CEF, the conditional sampling variance of $\\bhat$ has a similar form and will be similar to the \n$$\n\\mb{V}_{\\bhat} = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\approx \\mb{V}_{\\bfbeta} / n.\n$$\nIn practice, these two derivations lead to basically the same variance estimator. Recall that the heteroskedastic-consistent variance estimator\n$$\n\\widehat{\\mb{V}}_{\\bfbeta} = \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n$$\nis a valid plug-in estimator for the asymptotic variance and\n$$\n\\widehat{\\mb{V}}_{\\bhat} = n^{-1}\\widehat{\\mb{V}}_{\\bfbeta}.\n$$\nThus, in practice, the asymptotic and finite-sample results under a linear CEF justify the same variance estimator. \n\n\n### Linear CEF model under homoskedasticity\n\nIf we are willing to assume that the standard errors are homoskedastic, we can derive even stronger results for OLS. Stronger assumptions typically lead to stronger conclusions, but, obviously, those conclusions may not be robust to assumption violations. But homoskedasticity of errors is such a historically important assumption that statistical software implementations of OLS like `lm()` in R assume it by default. \n\n::: {.callout-note}\n\n## Assumption: Homoskedasticity with a linear CEF\n\nIn addition to the linear CEF assumption, we further assume that\n$$\n\\E[e_i^2 \\mid \\X_i] = \\E[e_i^2] = \\sigma^2,\n$$\nor that variance of the errors does not depend on the covariates. \n:::\n\n\n::: {#thm-homoskedasticity}\n\nUnder a linear CEF model with homoskedastic errors, the conditional sampling variance is\n$$\n\\mb{V}^{\\texttt{lm}}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\sigma^2 \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nand the variance estimator \n$$\n\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} = \\widehat{\\sigma}^2 \\left( \\Xmat'\\Xmat \\right)^{-1} \\quad\\text{where,}\\quad \\widehat{\\sigma}^2 = \\frac{1}{n - k - 1} \\sum_{i=1}^n \\widehat{e}_i^2\n$$\nis unbiased, $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n::: \n\n::: {.proof}\nUnder homoskedasticity $\\sigma^2_i = \\sigma^2$ for all $i$. Recall that $\\sum_{i=1}^n \\X_i\\X_i' = \\Xmat'\\Xmat$. Thus, the conditional sampling variance from @thm-ols-unbiased, \n$$ \n\\begin{aligned}\n\\V[\\bhat \\mid \\Xmat] &= \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2 \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\ &= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\Xmat'\\Xmat \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1} = \\mb{V}^{\\texttt{lm}}_{\\bhat}.\n\\end{aligned}\n$$\n\nFor unbiasedness, we just need to show that $\\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] = \\sigma^2$. Recall that we defined $\\mb{M}_{\\Xmat}$ as the residual-maker because $\\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}}$. We can use this to connect the residuals to the standard errors,\n$$ \n\\mb{M}_{\\Xmat}\\mb{e} = \\mb{M}_{\\Xmat}\\mb{Y} - \\mb{M}_{\\Xmat}\\Xmat\\bfbeta = \\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}},\n$$ \nso \n$$\n\\V[\\widehat{\\mb{e}} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\V[\\mb{e} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\sigma^2,\n$$\nwhere the first equality holds because $\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'$ is constant conditional on $\\Xmat$. Notice that the diagonal entries of this matrix are the variances of particular residuals $\\widehat{e}_i$ and that the diagonal entries of the annihilator matrix are $1 - h_{ii}$ (since the $h_{ii}$ are the diagonal entries of $\\mb{P}_{\\Xmat}$). Thus, we have\n$$ \n\\V[\\widehat{e}_i \\mid \\Xmat] = \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] = (1 - h_{ii})\\sigma^{2}.\n$$\nIn the last chapter in @sec-leverage, we established that one property of these leverage values is $\\sum_{i=1}^n h_{ii} = k+ 1$, so $\\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have\n$$ \n\\begin{aligned}\n \\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] &= \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] \\\\\n &= \\frac{\\sigma^{2}}{n-k-1} \\sum_{i=1}^{n} 1 - h_{ii} \\\\\n &= \\sigma^{2}. \n\\end{aligned}\n$$\nThis establishes $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n\n:::\n\n\nThus, under the linear CEF model and homoskedasticity of the errors, we have an unbiased variance estimator that is a simple function of the sum of squared residuals and the design matrix. Most statistical software packages estimate standard errors using $\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}$. \n\n\nThe final result we can derive for the linear CEF under the homoskedasticity assumption is an optimality result. That is, we might ask if there is another estimator for $\\bfbeta$ that would outperform OLS in the sense of having a lower sampling variance. Perhaps surprisingly, no linear estimator for $\\bfbeta$ has a lower conditional variance, meaning that OLS is the **best linear unbiased estimator**, often jovially shortened to BLUE. This result is famously known as the Gauss-Markov Theorem.\n\n::: {#thm-gauss-markov}\n\nLet $\\widetilde{\\bfbeta} = \\mb{AY}$ be a linear and unbiased estimator for $\\bfbeta$. Under the linear CEF model with homoskedastic errors, \n$$\n\\V[\\widetilde{\\bfbeta}\\mid \\Xmat] \\geq \\V[\\bhat \\mid \\Xmat]. \n$$\n\n:::\n\n::: {.proof}\nNote that if $\\widetilde{\\bfbeta}$ is unbiased then $\\E[\\widetilde{\\bfbeta} \\mid \\Xmat] = \\bfbeta$ and so \n$$\n\\bfbeta = \\E[\\mb{AY} \\mid \\Xmat] = \\mb{A}\\E[\\mb{Y} \\mid \\Xmat] = \\mb{A}\\Xmat\\bfbeta,\n$$\nwhich implies that $\\mb{A}\\Xmat = \\mb{I}_n$. \nRewrite the competitor as $\\widetilde{\\bfbeta} = \\bhat + \\mb{BY}$ where,\n$$ \n\\mb{B} = \\mb{A} - \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'.\n$$\nand note that $\\mb{A}\\Xmat = \\mb{I}_n$ implies that $\\mb{B}\\Xmat = 0$. We now have\n$$ \n\\begin{aligned}\n \\widetilde{\\bfbeta} &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{Y} \\\\\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\mb{B}\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e}\n\\end{aligned}\n$$\nThe variance of the competitor is, thus, \n$$ \n\\begin{aligned}\n \\V[\\widetilde{\\bfbeta} \\mid \\Xmat]\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\V[\\mb{e}\\mid \\Xmat]\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)' \\\\\n &= \\sigma^{2}\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\left( \\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{B}'\\right) \\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\mb{B}' + \\mb{B}\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &\\geq \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1} \\\\\n &= \\V[\\bhat \\mid \\Xmat]\n\\end{aligned}\n$$\nThe first equality comes from the properties of covariance matrices, the second is due to the homoskedasticity assumption, and the fourth is due to $\\mb{B}\\Xmat = 0$, which implies that $\\Xmat'\\mb{B}' = 0$ as well. The fifth inequality holds because matrix products of the form $\\mb{BB}'$ are positive definite if $\\mb{B}$ is of full rank (which we have assumed it is). \n\n:::\n\nIn this proof, we saw that the variance of the competing estimator had variance $\\sigma^2\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)$ which we argued was \"greater than 0\" in the matrix sense, which is also called positive definite. What does this mean practically? Remember that any positive definite matrix must have strictly positive diagonal entries and that the diagonal entries of $\\V[\\bhat \\mid \\Xmat]$ and $V[\\widetilde{\\bfbeta}\\mid \\Xmat]$ are the variances of the individual parameters, $\\V[\\widehat{\\beta}_{j} \\mid \\Xmat]$ and $\\V[\\widetilde{\\beta}_{j} \\mid \\Xmat]$. Thus, the variances of the individual parameters will be larger for $\\widetilde{\\bfbeta}$ than for $\\bhat$.\n\nMany textbooks cite the Gauss-Markov theorem as a critical advantage of OLS over other methods, but recognizing its limitations is essential. It requires linearity and homoskedastic error assumptions, and these can be false in many applications. \n\nFinally, note that while we have shown this result for linear estimators, @Hansen22 proves a more general version of this result that applies to any unbiased estimator. \n\n## The normal linear model\n\nFinally, we add the strongest and thus least loved of the classical linear regression assumption: (conditional) normality of the errors. Historically the reason to use this assumption was that finite-sample inference hits a roadblock without some knowledge of the sampling distribution of $\\bhat$. Under the linear CEF model, we saw that $\\bhat$ is unbiased, and under homoskedasticity, we could produce an unbiased estimator of the conditional variance. But for hypothesis testing or for generating confidence intervals, we need to make probability statements about the estimator, and, for that, we need to know its exact distribution. When the sample size is large, we can rely on the CLT and know $\\bhat$ is approximately normal. But how do we proceed in small samples? Historically we would have assumed (conditional) normality of the errors, basically proceeding with some knowledge that we were wrong but hopefully not too wrong. \n\n\n::: {.callout-note}\n\n## The normal linear regression model\n\nIn addition to the linear CEF assumption, we assume that \n$$\ne_i \\mid \\Xmat \\sim \\N(0, \\sigma^2).\n$$\n\n:::\n\nThere are a couple of important points: \n\n- The assumption here is not that $(Y_{i}, \\X_{i})$ are jointly normal (though this would be sufficient for the assumption to hold), but rather that $Y_i$ is normally distributed conditional on $\\X_i$. \n- Notice that the normal regression model has the homoskedasticity assumption baked in. \n\n::: {#thm-normal-ols}\n\nUnder the normal linear regression model, we have\n$$ \n\\begin{aligned}\n \\bhat \\mid \\Xmat &\\sim \\N\\left(\\bfbeta, \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1}\\right) \\\\\n \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}]_{jj}/\\sqrt{n}} &\\sim t_{n-k-1} \\\\\n W/q &\\sim F_{q, n-k-1}. \n\\end{aligned}\n$$\n\n:::\n\n\nThis theorem says that in the normal linear regression model, the coefficients follow a normal distribution, the t-statistics follow a $t$-distribution, and a transformation of the Wald statistic follows an $F$ distribution. These are **exact** results and do not rely on large-sample approximations. Under the assumption of conditional normality of the errors, the results are as valid for $n = 5$ as for $n = 500,000$. \n\nFew people believe errors follow a normal distribution, so why even present these results? Unfortunately, most statistical software implementations of OLS implicitly assume this when calculating p-values for tests or constructing confidence intervals. In R, for example, the p-value associated with the $t$-statistic reported by `lm()` relies on the $t_{n-k-1}$ distribution, and the critical values used to construct confidence intervals with `confint()` use that distribution as well. When normality does not hold, there is no principled reason to use the $t$ or the $F$ distributions in this way. But we might hold our nose and use this *ad hoc* procedure under two rationalizations:\n\n- $\\bhat$ is asymptotically normal. This approximation might, however, be poor in smaller finite samples. The $t$ distribution will make inference more conservative in these cases (wider confidence intervals, smaller test rejection regions), which might help offset its poor approximation of the normal distribution in small samples. \n- As $n\\to\\infty$, the $t_{n-k-1}$ will converge to a standard normal distribution, so the *ad hoc* adjustment will not matter much for medium to large samples. \n\nThese arguments are not very convincing since whether the $t$ approximation will be any better than the normal in finite samples is unclear. But it may be the best we can do while we go and find more data. \n\n## Summary\n\nIn this chapter, we discussed the large-sample properties of OLS, which are quite strong. Under mild conditions, OLS is consistent for the population linear regression coefficients and is asymptotically normal. The variance of the OLS estimator, and thus the variance estimator, depends on whether the projection errors are assumed to be unrelated to the covariates (**homoskedastic**) or possibly related (**heteroskedastic**). Confidence intervals and hypothesis tests for individual OLS coefficients are largely the same as discussed in Part I of this book, and we can obtain finite-sample properties of OLS such as conditional unbiasedness if we assume the conditional expectation function is linear. If we further assume the errors are normally distributed, we can derive confidence intervals and hypothesis tests that are valid for all sample sizes. \n", + "markdown": "\n\n# The statistics of least squares {#sec-ols-statistics}\n\nThe last chapter showcased the least squares estimator and investigated many of its more mechanical properties, which are essential for the practical application of OLS. But we still need to understand its statistical properties, as we discussed in Part I of this book: unbiasedness, sampling variance, consistency, and asymptotic normality. As we saw then, these properties fall into finite-sample (unbiasedness, sampling variance) and asymptotic (consistency, asymptotic normality). \n\nIn this chapter, we will focus on the asymptotic properties of OLS because those properties hold under the relatively mild conditions of the linear projection model introduced in @sec-linear-projection. We will see that OLS consistently estimates a coherent quantity of interest (the best linear predictor) regardless of whether the conditional expectation is linear. That is, for the asymptotic properties of the estimator, we will not need the commonly invoked linearity assumption. Later, when we investigate the finite-sample properties, we will show how linearity will help us establish unbiasedness and also how the normality of the errors can allow us to conduct exact, finite-sample inference. But these assumptions are very strong, so understanding what we can say about OLS without them is vital. \n\n## Large-sample properties of OLS\n\nAs we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: (1) a consistent estimate of the variance of $\\bhat$ and (2) the approximate distribution of $\\bhat$ in large samples. Remember that, since $\\bhat$ is a vector, the variance of that estimator will actually be a variance-covariance matrix. To obtain the two key ingredients, we first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which includes its variance. \n\n\nWe begin by setting out the assumptions needed for establishing the large-sample properties of OLS, which are the same as the assumptions needed to ensure that the best linear predictor, $\\bfbeta = \\E[\\X_{i}\\X_{i}']^{-1}\\E[\\X_{i}Y_{i}]$, is well-defined and unique. \n\n::: {.callout-note}\n\n### Linear projection assumptions\n\nThe linear projection model makes the following assumptions:\n\n1. $\\{(Y_{i}, \\X_{i})\\}_{i=1}^n$ are iid random vectors\n\n2. $\\E[Y^{2}_{i}] < \\infty$ (finite outcome variance)\n\n3. $\\E[\\Vert \\X_{i}\\Vert^{2}] < \\infty$ (finite variances and covariances of covariates)\n\n2. $\\E[\\X_{i}\\X_{i}']$ is positive definite (no linear dependence in the covariates)\n:::\n\n\nRecall that these are mild conditions on the joint distribution of $(Y_{i}, \\X_{i})$ and in particular, we are **not** assuming linearity of the CEF, $\\E[Y_{i} \\mid \\X_{i}]$, nor are we assuming any specific distribution for the data. \n\nWe can helpfully decompose the OLS estimator into the actual BLP coefficient plus estimation error as\n$$ \n\\bhat = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_iY_i \\right) = \\bfbeta + \\underbrace{\\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\right)}_{\\text{estimation error}}.\n$$ \n \nThis decomposition will help us quickly establish the consistency of $\\bhat$. By the law of large numbers, we know that sample means will converge in probability to population expectations, so we have\n$$ \n\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\inprob \\E[\\X_i\\X_i'] \\equiv \\mb{Q}_{\\X\\X} \\qquad \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\inprob \\E[\\X_{i} e_{i}] = \\mb{0},\n$$\nwhich implies by the continuous mapping theorem (the inverse is a continuous function) that \n$$\n\\bhat \\inprob \\bfbeta + \\mb{Q}_{\\X\\X}^{-1}\\E[\\X_ie_i] = \\bfbeta,\n$$\nThe linear projection assumptions ensure that the LLN applies to these sample means and that $\\E[\\X_{i}\\X_{i}']$ is invertible. \n\n\n::: {#thm-ols-consistency}\nUnder the above linear projection assumptions, the OLS estimator is consistent for the best linear projection coefficients, $\\bhat \\inprob \\bfbeta$.\n:::\n\nThus, OLS should be close to the population linear regression in large samples under relatively mild conditions. Remember that this may not equal the conditional expectation if the CEF is nonlinear. What we can say is that OLS converges to the best *linear* approximation to the CEF. Of course, this also means that, if the CEF is linear, then OLS will consistently estimate the coefficients of the CEF. \n\nTo emphasize, the only assumptions made about the dependent variable are that it (1) has finite variance and (2) is iid. Under this assumption, the outcome could be continuous, categorical, binary, or event count. \n\n\nNext, we would like to establish an asymptotic normality result for the OLS coefficients. We first review some key ideas about the Central Limit Theorem.\n\n::: {.callout-note}\n\n## CLT reminder\n\nSuppose that we have a function of the data iid random vectors $\\X_1, \\ldots, \\X_n$, $g(\\X_{i})$ where $\\E[g(\\X_{i})] = 0$ and so $\\V[g(\\X_{i})] = \\E[g(\\X_{i})g(\\X_{i})']$. Then if $\\E[\\Vert g(\\X_{i})\\Vert^{2}] < \\infty$, the CLT implies that\n$$ \n\\sqrt{n}\\left(\\frac{1}{n} \\sum_{i=1}^{n} g(\\X_{i}) - \\E[g(\\X_{i})]\\right) = \\frac{1}{\\sqrt{n}} \\sum_{i=1}^{n} g(\\X_{i}) \\indist \\N(0, \\E[g(\\X_{i})g(\\X_{i}')]) \n$$ {#eq-clt-mean-zero}\n:::\n\nWe now manipulate our decomposition to arrive at the *stabilized* version of the estimator,\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\right).\n$$\nRecall that we stabilize an estimator to ensure it has a fixed variance as the sample size grows, allowing it to have a non-degenerate asymptotic distribution. The stabilization works by asymptotically centering it (that is, subtracting the value to which it converges) and multiplying by the square root of the sample size. We have already established that the first term on the right-hand side will converge in probability to $\\mb{Q}_{\\X\\X}^{-1}$. Notice that $\\E[\\X_{i}e_{i}] = 0$, so we can apply @eq-clt-mean-zero to the second term. The covariance matrix of $\\X_ie_{i}$ is \n$$ \n\\mb{\\Omega} = \\V[\\X_{i}e_{i}] = \\E[\\X_{i}e_{i}(\\X_{i}e_{i})'] = \\E[e_{i}^{2}\\X_{i}\\X_{i}'].\n$$ \nThe CLT will imply that\n$$ \n\\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\indist \\N(0, \\mb{\\Omega}).\n$$\nCombining these facts with Slutsky's Theorem implies the following theorem. \n\n::: {#thm-ols-asymptotic-normality}\n\nSuppose that the linear projection assumptions hold and, in addition, we have $\\E[Y_{i}^{4}] < \\infty$ and $\\E[\\lVert\\X_{i}\\rVert^{4}] < \\infty$. Then the OLS estimator is asymptotically normal with\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) \\indist \\N(0, \\mb{V}_{\\bfbeta}),\n$$\nwhere\n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1}.\n$$\n\n:::\n\nThus, with a large enough sample size we can approximate the distribution of $\\bhat$ with a multivariate normal distribution with mean $\\bfbeta$ and covariance matrix $\\mb{V}_{\\bfbeta}/n$. In particular, the square root of the $j$th diagonals of this matrix will be standard errors for $\\widehat{\\beta}_j$. Knowing the shape of the OLS estimator's multivariate distribution will allow us to conduct hypothesis tests and generate confidence intervals for both individual coefficients and groups of coefficients. But, first, we need an estimate of the covariance matrix.\n\n\n\n## Variance estimation for OLS\n\nThe asymptotic normality of OLS from the last section is of limited value without some way to estimate the covariance matrix, \n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1}.\n$$\nSince each term here is a population mean, this is an ideal place in which to drop a plug-in estimator. For now, we will use the following estimators:\n$$ \n\\begin{aligned}\n \\mb{Q}_{\\X\\X} &= \\E[\\X_{i}\\X_{i}'] & \\widehat{\\mb{Q}}_{\\X\\X} &= \\frac{1}{n} \\sum_{i=1}^{n} \\X_{i}\\X_{i}' = \\frac{1}{n}\\Xmat'\\Xmat \\\\\n \\mb{\\Omega} &= \\E[e_i^2\\X_i\\X_i'] & \\widehat{\\mb{\\Omega}} & = \\frac{1}{n}\\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i'.\n\\end{aligned}\n$$\nUnder the assumptions of @thm-ols-asymptotic-normality, the LLN will imply that these are consistent for the quantities we need, $\\widehat{\\mb{Q}}_{\\X\\X} \\inprob \\mb{Q}_{\\X\\X}$ and $\\widehat{\\mb{\\Omega}} \\inprob \\mb{\\Omega}$. We can plug these into the variance formula to arrive at\n$$ \n\\begin{aligned}\n \\widehat{\\mb{V}}_{\\bfbeta} &= \\widehat{\\mb{Q}}_{\\X\\X}^{-1}\\widehat{\\mb{\\Omega}}\\widehat{\\mb{Q}}_{\\X\\X}^{-1} \\\\\n &= \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n\\end{aligned}\n$$\nwhich by the continuous mapping theorem is consistent, $\\widehat{\\mb{V}}_{\\bfbeta} \\inprob \\mb{V}_{\\bfbeta}$. \n\nThis estimator is sometimes called the **robust variance estimator** or, more accurately, the **heteroskedasticity-consistent (HC) variance estimator**. Why is it robust? Consider the standard **homoskedasticity** assumption that most statistical software packages make when estimating OLS variances: the variance of the errors does not depend on the covariates, or $\\V[e_{i}^{2} \\mid \\X_{i}] = \\V[e_{i}^{2}]$. This assumption is stronger than needed, and we can rely on a weaker assumption that the squared errors are uncorrelated with a specific function of the covariates: \n$$ \n\\E[e_{i}^{2}\\X_{i}\\X_{i}'] = \\E[e_{i}^{2}]\\E[\\X_{i}\\X_{i}'] = \\sigma^{2}\\mb{Q}_{\\X\\X}, \n$$\nwhere $\\sigma^2$ is the variance of the residuals (since $\\E[e_{i}] = 0$). Homoskedasticity simplifies the asymptotic variance of the stabilized estimator, $\\sqrt{n}(\\bhat - \\bfbeta)$, to\n$$ \n\\mb{V}^{\\texttt{lm}}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\sigma^{2}\\mb{Q}_{\\X\\X}\\mb{Q}_{\\X\\X}^{-1} = \\sigma^2\\mb{Q}_{\\X\\X}^{-1}.\n$$\nWe already have an estimator for $\\mb{Q}_{\\X\\X}$, but we need one for $\\sigma^2$. We can easily use the SSR,\n$$ \n\\widehat{\\sigma}^{2} = \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\widehat{e}_{i}^{2},\n$$\nwhere we use $n-k-1$ in the denominator instead of $n$ to correct for the residuals being slightly less variable than the actual errors (because OLS mechanically attempts to make the residuals small). For consistent variance estimation, $n-k -1$ or $n$ can be used, since either way $\\widehat{\\sigma}^2 \\inprob \\sigma^2$. Thus, under homoskedasticity, we have\n$$ \n\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}} = \\widehat{\\sigma}^{2}\\left(\\frac{1}{n}\\Xmat'\\Xmat\\right)^{{-1}} = n\\widehat{\\sigma}^{2}\\left(\\Xmat'\\Xmat\\right)^{{-1}},\n$$\nThis is the standard variance estimator used by `lm()` in R and `reg` in Stata. \n\n\nHow do these two estimators, $\\widehat{\\mb{V}}_{\\bfbeta}$ and $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$, compare? Notice that the HC variance estimator and the homoskedasticity variance estimator will both be consistent when homoskedasticity holds. But as the \"heteroskedasticity-consistent\" label implies, only the HC variance estimator will be consistent when homoskedasticity fails to hold. So $\\widehat{\\mb{V}}_{\\bfbeta}$ has the advantage of being consistent regardless of the homoskedasticity assumption. This advantage comes at a cost, however. When homoskedasticity is correct, $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$ incorporates that assumption into the estimator whereas the HC variance estimator has to estimate it. The HC estimator will therefore have higher variance (the variance estimator will be more variable!) when homoskedasticity actually does hold. \n\n\n\n\n\nNow that we have established the asymptotic normality of the OLS estimator and developed a consistent estimator of its variance, we can proceed with all of the statistical inference tools we discussed in Part I, including hypothesis tests and confidence intervals. \n\nWe begin by defining the estimated **heteroskedasticity-consistent standard errors** as\n$$ \n\\widehat{\\se}(\\widehat{\\beta}_{j}) = \\sqrt{\\frac{[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}}{n}},\n$$\nwhere $[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}$ is the $j$th diagonal entry of the HC variance estimator. Note that we divide by $\\sqrt{n}$ here because $\\widehat{\\mb{V}}_{\\bfbeta}$ is a consistent estimator of the stabilized estimator $\\sqrt{n}(\\bhat - \\bfbeta)$ not the estimator itself. \n\nHypothesis tests and confidence intervals for individual coefficients are almost precisely the same as with the most general case presented in Part I. For a two-sided test of $H_0: \\beta_j = b$ versus $H_1: \\beta_j \\neq b$, we can build the t-statistic and conclude that, under the null,\n$$\n\\frac{\\widehat{\\beta}_j - b}{\\widehat{\\se}(\\widehat{\\beta}_{j})} \\indist \\N(0, 1).\n$$\nStatistical software will typically and helpfully provide the t-statistic for the null hypothesis of no (partial) linear relationship between $X_{ij}$ and $Y_i$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nwhich measures how large the estimated coefficient is in standard errors. With $\\alpha = 0.05$, asymptotic normality would imply that we reject this null when $t > 1.96$. We can form asymptotically-valid confidence intervals with \n$$ \n\\left[\\widehat{\\beta}_{j} - z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j}),\\;\\widehat{\\beta}_{j} + z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j})\\right]. \n$$\nFor reasons we will discuss below, standard software typically relies on the $t$ distribution instead of the normal for hypothesis testing and confidence intervals. Still, this difference is of little consequence in large samples. \n\n## Inference for multiple parameters\n\nWith multiple coefficients, we might have hypotheses that involve more than one coefficient. As an example, consider a regression with an interaction between two covariates, \n$$\nY_i = \\beta_0 + X_i\\beta_1 + Z_i\\beta_2 + X_iZ_i\\beta_3 + e_i.\n$$\nSuppose we wanted to test the hypothesis that $X_i$ does not affect the best linear predictor for $Y_i$. That would be\n$$ \nH_{0}: \\beta_{1} = 0 \\text{ and } \\beta_{3} = 0\\quad\\text{vs}\\quad H_{1}: \\beta_{1} \\neq 0 \\text{ or } \\beta_{3} \\neq 0,\n$$\nwhere we usually write the null more compactly as $H_0: \\beta_1 = \\beta_3 = 0$. \n\nTo test this null hypothesis, we need a test statistic that discriminates between the two hypotheses: it should be large when the alternative is true and small enough when the null is true. With a single coefficient, we usually test the null hypothesis of $H_0: \\beta_j = b_0$ with the $t$-statistic, \n$$ \nt = \\frac{\\widehat{\\beta}_{j} - b_{0}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nand we usually take the absolute value, $|t|$, as our measure of how extreme our estimate is given the null distribution. But notice that we could also use the square of the $t$ statistic, which is\n$$ \nt^{2} = \\frac{\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{\\V[\\widehat{\\beta}_{j}]} = \\frac{n\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{[\\mb{V}_{\\bfbeta}]_{[jj]}}. \n$$ {#eq-squared-t}\n\nWhile $|t|$ is the usual test statistic we use for two-sided tests, we could equivalently use $t^2$ and arrive at the exact same conclusions (as long as we knew the distribution of $t^2$ under the null hypothesis). It turns out that the $t^2$ version of the test statistic will generalize more easily to comparing multiple coefficients. This version of the test statistic suggests another general way to differentiate the null from the alternative: by taking the squared distance between them and dividing by the variance of the estimate. \n\nCan we generalize this idea to hypotheses about multiple parameters? Adding the sum of squared distances for each component of the null hypothesis is straightforward. For our interaction example, that would be\n$$ \n\\widehat{\\beta}_1^2 + \\widehat{\\beta}_3^2, \n$$\nRemember, however, that some of the estimated coefficients are noisier than others, so we should account for the uncertainty just like we did for the $t$-statistic. \n\nWith multiple parameters and multiple coefficients, the variances will now require matrix algebra. We can write any hypothesis about linear functions of the coefficients as $H_{0}: \\mb{L}\\bfbeta = \\mb{c}$. For example, in the interaction case, we have\n$$ \n\\mb{L} =\n\\begin{pmatrix}\n 0 & 1 & 0 & 0 \\\\\n 0 & 0 & 0 & 1 \\\\\n\\end{pmatrix}\n\\qquad\n\\mb{c} =\n\\begin{pmatrix}\n 0 \\\\\n 0\n\\end{pmatrix}\n$$\nThus, $\\mb{L}\\bfbeta = \\mb{0}$ is equivalent to $\\beta_1 = 0$ and $\\beta_3 = 0$. Notice that with other $\\mb{L}$ matrices, we could represent more complicated hypotheses like $2\\beta_1 - \\beta_2 = 34$, though we mostly stick to simpler functions. Let $\\widehat{\\bs{\\theta}} = \\mb{L}\\bhat$ be the OLS estimate of the function of the coefficients. By the delta method (discussed in @sec-delta-method), we have\n$$ \n\\sqrt{n}\\left(\\mb{L}\\bhat - \\mb{L}\\bfbeta\\right) \\indist \\N(0, \\mb{L}\\mb{V}_{\\bfbeta}\\mb{L}').\n$$\nWe can now generalize the squared $t$ statistic in @eq-squared-t by taking the distances $\\mb{L}\\bhat - \\mb{c}$ weighted by the variance-covariance matrix $\\mb{L}\\mb{V}_{\\bfbeta}\\mb{L}'$, \n$$ \nW = n(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}\\mb{V}_{\\bfbeta}\\mb{L}')^{-1}(\\mb{L}\\bhat - \\mb{c}),\n$$\nwhich is called the **Wald test statistic**. This statistic generalizes the ideas of the t-statistic to multiple parameters. With the t-statistic, we recenter to have mean 0 and divide by the standard error to get a variance of 1. If we ignore the middle variance weighting, we have $(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}\\bhat - \\mb{c})$ which is just the sum of the squared deviations of the estimates from the null. Including the $(\\mb{L}\\mb{V}_{\\bfbeta}\\mb{L}')^{-1}$ weight has the effect of rescaling the distribution of $\\mb{L}\\bhat - \\mb{c}$ to make it rotationally symmetric around 0 (so the resulting dimensions are uncorrelated) with each dimension having an equal variance of 1. In this way, the Wald statistic transforms the random vectors to be mean-centered and have variance 1 (just the t-statistic), but also to have the resulting random variables in the vector be uncorrelated.[^norms]\n\n\n[^norms]: The form of the Wald statistic is that of a weighted inner product, $\\mb{x}'\\mb{Ay}$, where $\\mb{A}$ is a symmetric positive-definite weighting matrix. \n\nWhy transform the data in this way? @fig-wald shows the contour plot of a hypothetical joint distribution of two coefficients from an OLS regression. We might want to know the distance between different points in the distribution and the mean, which in this case is $(1, 2)$. Without considering the joint distribution, the circle is obviously closer to the mean than the triangle. However, looking at the two points on the distribution, the circle is at a lower contour than the triangle, meaning it is more extreme than the triangle for this particular distribution. The Wald statistic, then, takes into consideration how much of a \"climb\" it is for $\\mb{L}\\bhat$ to get to $\\mb{c}$ given the distribution of $\\mb{L}\\bhat$.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of two slope coefficients. The circle is closer to the center of the distribution by the standard Euclidean distance, but the triangle is closer once you consider the joint distribution.](ols_properties_files/figure-html/fig-wald-1.png){#fig-wald width=672}\n:::\n:::\n\n\n\nIf $\\mb{L}$ only has one row, our Wald statistic is the same as the squared $t$ statistic, $W = t^2$. This fact will help us think about the asymptotic distribution of $W$. Note that as $n\\to\\infty$, we know that by the asymptotic normality of $\\bhat$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{\\widehat{\\se}[\\widehat{\\beta}_{j}]} \\indist \\N(0,1)\n$$\nso $t^2$ will converge in distribution to a $\\chi^2_1$ (since a $\\chi^2_1$ distribution is just one standard normal distribution squared). After recentering and rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\\mb{L}\\bhat = \\mb{c}$, we have $W \\indist \\chi^2_{q}$. \n\n\nWe need to define the rejection region to use the Wald statistic in a hypothesis test. Because we are squaring each distance in $W \\geq 0$, larger values of $W$ indicate more disagreement with the null in either direction. Thus, for an $\\alpha$-level test of the joint null, we only need a one-sided rejection region of the form $\\P(W > w_{\\alpha}) = \\alpha$. Obtaining these values is straightforward (see the above callout tip). For $q = 2$ and a $\\alpha = 0.05$, the critical value is roughly 6. \n\n\n\n::: {.callout-note}\n\n## Chi-squared critical values\n\nWe can obtain critical values for the $\\chi^2_q$ distribution using the `qchisq()` function in R. For example, if we wanted to obtain the critical value $w$ such that $\\P(W > w_{\\alpha}) = \\alpha$ for our two-parameter interaction example, we could use:\n\n::: {.cell}\n\n```{.r .cell-code}\nqchisq(p = 0.95, df = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5.991465\n```\n:::\n:::\n\n:::\n\n\nThe Wald statistic is not a common test provided by standard statistical software functions like `lm()` in R, though it is fairly straightforward to implement \"by hand.\" Alternatively, packages like [`{aod}`](https://cran.r-project.org/web/packages/aod/index.html) or [`{clubSandwich}`](http://jepusto.github.io/clubSandwich/) have implementations of the test. What is reported by most software implementations of OLS (like `lm()` in R) is the F-statistic, which is\n$$ \nF = \\frac{W}{q}.\n$$\nThis also typically uses the homoskedastic variance estimator $\\mb{V}^{\\texttt{lm}}_{\\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution has no justification in statistical theory, but it is slightly more conservative than the $\\chi^2_q$ distribution, and the inferences from the $F$ statistic will converge to those from the $\\chi^2_q$ distribution as $n\\to\\infty$. So it might be justified as an *ad hoc* small-sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and we have, say, a sample size of $n = 100$, then in that case, the critical value for the F test with $\\alpha = 0.05$ is\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqf(0.95, df1 = 2, df2 = 100 - 4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.091191\n```\n:::\n:::\n\n\nThis result implies a critical value of 6.182 on the scale of the Wald statistic (multiplying it by $q = 2$). Compared to the earlier critical value of 5.991 based on the $\\chi^2_2$ distribution, we can see that the inferences will be very similar even in moderately-sized datasets. \n\nFinally, note that the F-statistic reported by `lm()` in R is the test of all the coefficients being equal to 0 jointly except for the intercept. In modern quantitative social sciences, this test is seldom substantively interesting. \n\n\n## Finite-sample properties with a linear CEF\n\nAll the above results have been large-sample properties, and we have not addressed finite-sample properties like the sampling variance or unbiasedness. Under the linear projection assumption above, OLS is generally biased without stronger assumptions. This section introduces the stronger assumption that will allow us to establish stronger properties for OLS. As usual, however, remember that these stronger assumptions can be wrong. \n\n\n::: {.callout-note}\n## Assumption: Linear Regression Model\n1. The variables $(Y_{i}, \\X_{i})$ satisfy the linear CEF assumption.\n$$ \n\\begin{aligned}\n Y_{i} &= \\X_{i}'\\bfbeta + e_{i} \\\\\n \\E[e_{i}\\mid \\X_{i}] & = 0.\n\\end{aligned}\n$$\n\n2. The design matrix is invertible $\\E[\\X_{i}\\X_{i}'] > 0$ (positive definite).\n:::\n\n\nWe discussed the concept of a linear CEF extensively in @sec-regression. However, recall that the CEF might be linear mechanically if the model is **saturated** or when there are as many coefficients in the model as there are unique values of $\\X_i$. When a model is not saturated, the linear CEF assumption is just that: an assumption. What can this assumption do? It can aid in establishing some nice statistical properties in finite samples. \n\nBefore proceeding, note that, when focusing on the finite sample inference for OLS, we focused on its properties **conditional on the observed covariates**, such as $\\E[\\bhat \\mid \\Xmat]$ or $\\V[\\bhat \\mid \\Xmat]$. The historical reason for this is that the researcher often chose these independent variables and so they were not random. Thus, sometimes $\\Xmat$ is treated as \"fixed\" in some older texts, which might even omit explicit conditioning statements. \n\n\n::: {#thm-ols-unbiased}\n\nUnder the linear regression model assumption, OLS is unbiased for the population regression coefficients, \n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta,\n$$\nand its conditional sampling variance is\n$$\n\\mb{\\V}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nwhere $\\sigma^2_{i} = \\E[e_{i}^{2} \\mid \\Xmat]$. \n:::\n\n\n::: {.proof}\n\nTo prove the conditional unbiasedness, recall that we can write the OLS estimator as\n$$\n\\bhat = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e},\n$$\nand so taking (conditional) expectations, we have\n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta + \\E[(\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\E[\\mb{e} \\mid \\Xmat] = \\bfbeta,\n$$\nbecause under the linear CEF assumption $\\E[\\mb{e}\\mid \\Xmat] = 0$. \n\nFor the conditional sampling variance, we can use the same decomposition we have,\n$$\n\\V[\\bhat \\mid \\Xmat] = \\V[\\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = (\\Xmat'\\Xmat)^{-1}\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat(\\Xmat'\\Xmat)^{-1}. \n$$\nSince $\\E[\\mb{e}\\mid \\Xmat] = 0$, we know that $\\V[\\mb{e}\\mid \\Xmat] = \\E[\\mb{ee}' \\mid \\Xmat]$, which is a matrix with diagonal entries $\\E[e_{i}^{2} \\mid \\Xmat] = \\sigma^2_i$ and off-diagonal entries $\\E[e_{i}e_{j} \\Xmat] = \\E[e_{i}\\mid \\Xmat]\\E[e_{j}\\mid\\Xmat] = 0$, where the first equality follows from the independence of the errors across units. Thus, $\\V[\\mb{e} \\mid \\Xmat]$ is a diagonal matrix with $\\sigma^2_i$ along the diagonal, which means\n$$\n\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat = \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i',\n$$\nestablishing the conditional sampling variance.\n \n:::\n\nThis means that, for any realization of the covariates, $\\Xmat$, OLS is unbiased for the true regression coefficients $\\bfbeta$. By the law of iterated expectation, we also know that it is unconditionally unbiased[^unconditional] as well since\n$$\n\\E[\\bhat] = \\E[\\E[\\bhat \\mid \\Xmat]] = \\bfbeta. \n$$\nThe difference between these two statements usually isn't incredibly meaningful. \n\n[^unconditional]: We are basically ignoring some edge cases when it comes to discrete covariates here. In particular, we assume that $\\Xmat'\\Xmat$ is nonsingular with probability one. However, this assumption can fail if we have a binary covariate since there is some chance (however slight) that the entire column will be all ones or all zeros, which would lead to a singular matrix $\\Xmat'\\Xmat$. Practically this is not a big deal, but it does mean that we have to ignore this issue theoretically or focus on conditional unbiasedness. \n\n\nThere are a lot of variances flying around, so reviewing them is helpful. Above, we derived the asymptotic variance of $\\mb{Z}_{n} = \\sqrt{n}(\\bhat - \\bfbeta)$, \n$$\n\\mb{V}_{\\bfbeta} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1},\n$$\nwhich implies that the approximate variance of $\\bhat$ will be $\\mb{V}_{\\bfbeta} / n$ because\n$$\n\\bhat = \\frac{Z_n}{\\sqrt{n}} + \\bfbeta \\quad\\implies\\quad \\bhat \\overset{a}{\\sim} \\N(\\bfbeta, n^{-1}\\mb{V}_{\\bfbeta}),\n$$\nwhere $\\overset{a}{\\sim}$ means asymptotically distributed as. Under the linear CEF, the conditional sampling variance of $\\bhat$ has a similar form and will be similar to the \n$$\n\\mb{V}_{\\bhat} = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\approx \\mb{V}_{\\bfbeta} / n.\n$$\nIn practice, these two derivations lead to basically the same variance estimator. Recall that the heteroskedastic-consistent variance estimator\n$$\n\\widehat{\\mb{V}}_{\\bfbeta} = \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n$$\nis a valid plug-in estimator for the asymptotic variance and\n$$\n\\widehat{\\mb{V}}_{\\bhat} = n^{-1}\\widehat{\\mb{V}}_{\\bfbeta}.\n$$\nThus, in practice, the asymptotic and finite-sample results under a linear CEF justify the same variance estimator. \n\n\n### Linear CEF model under homoskedasticity\n\nIf we are willing to assume that the standard errors are homoskedastic, we can derive even stronger results for OLS. Stronger assumptions typically lead to stronger conclusions, but, obviously, those conclusions may not be robust to assumption violations. But homoskedasticity of errors is such a historically important assumption that statistical software implementations of OLS like `lm()` in R assume it by default. \n\n::: {.callout-note}\n\n## Assumption: Homoskedasticity with a linear CEF\n\nIn addition to the linear CEF assumption, we further assume that\n$$\n\\E[e_i^2 \\mid \\X_i] = \\E[e_i^2] = \\sigma^2,\n$$\nor that variance of the errors does not depend on the covariates. \n:::\n\n\n::: {#thm-homoskedasticity}\n\nUnder a linear CEF model with homoskedastic errors, the conditional sampling variance is\n$$\n\\mb{V}^{\\texttt{lm}}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\sigma^2 \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nand the variance estimator \n$$\n\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} = \\widehat{\\sigma}^2 \\left( \\Xmat'\\Xmat \\right)^{-1} \\quad\\text{where,}\\quad \\widehat{\\sigma}^2 = \\frac{1}{n - k - 1} \\sum_{i=1}^n \\widehat{e}_i^2\n$$\nis unbiased, $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n::: \n\n::: {.proof}\nUnder homoskedasticity $\\sigma^2_i = \\sigma^2$ for all $i$. Recall that $\\sum_{i=1}^n \\X_i\\X_i' = \\Xmat'\\Xmat$. Thus, the conditional sampling variance from @thm-ols-unbiased, \n$$ \n\\begin{aligned}\n\\V[\\bhat \\mid \\Xmat] &= \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2 \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\ &= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\Xmat'\\Xmat \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1} = \\mb{V}^{\\texttt{lm}}_{\\bhat}.\n\\end{aligned}\n$$\n\nFor unbiasedness, we just need to show that $\\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] = \\sigma^2$. Recall that we defined $\\mb{M}_{\\Xmat}$ as the residual-maker because $\\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}}$. We can use this to connect the residuals to the standard errors,\n$$ \n\\mb{M}_{\\Xmat}\\mb{e} = \\mb{M}_{\\Xmat}\\mb{Y} - \\mb{M}_{\\Xmat}\\Xmat\\bfbeta = \\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}},\n$$ \nso \n$$\n\\V[\\widehat{\\mb{e}} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\V[\\mb{e} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\sigma^2,\n$$\nwhere the first equality holds because $\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'$ is constant conditional on $\\Xmat$. Notice that the diagonal entries of this matrix are the variances of particular residuals $\\widehat{e}_i$ and that the diagonal entries of the annihilator matrix are $1 - h_{ii}$ (since the $h_{ii}$ are the diagonal entries of $\\mb{P}_{\\Xmat}$). Thus, we have\n$$ \n\\V[\\widehat{e}_i \\mid \\Xmat] = \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] = (1 - h_{ii})\\sigma^{2}.\n$$\nIn the last chapter in @sec-leverage, we established that one property of these leverage values is $\\sum_{i=1}^n h_{ii} = k+ 1$, so $\\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have\n$$ \n\\begin{aligned}\n \\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] &= \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] \\\\\n &= \\frac{\\sigma^{2}}{n-k-1} \\sum_{i=1}^{n} 1 - h_{ii} \\\\\n &= \\sigma^{2}. \n\\end{aligned}\n$$\nThis establishes $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n\n:::\n\n\nThus, under the linear CEF model and homoskedasticity of the errors, we have an unbiased variance estimator that is a simple function of the sum of squared residuals and the design matrix. Most statistical software packages estimate standard errors using $\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}$. \n\n\nThe final result we can derive for the linear CEF under the homoskedasticity assumption is an optimality result. That is, we might ask if there is another estimator for $\\bfbeta$ that would outperform OLS in the sense of having a lower sampling variance. Perhaps surprisingly, no linear estimator for $\\bfbeta$ has a lower conditional variance, meaning that OLS is the **best linear unbiased estimator**, often jovially shortened to BLUE. This result is famously known as the Gauss-Markov Theorem.\n\n::: {#thm-gauss-markov}\n\nLet $\\widetilde{\\bfbeta} = \\mb{AY}$ be a linear and unbiased estimator for $\\bfbeta$. Under the linear CEF model with homoskedastic errors, \n$$\n\\V[\\widetilde{\\bfbeta}\\mid \\Xmat] \\geq \\V[\\bhat \\mid \\Xmat]. \n$$\n\n:::\n\n::: {.proof}\nNote that if $\\widetilde{\\bfbeta}$ is unbiased then $\\E[\\widetilde{\\bfbeta} \\mid \\Xmat] = \\bfbeta$ and so \n$$\n\\bfbeta = \\E[\\mb{AY} \\mid \\Xmat] = \\mb{A}\\E[\\mb{Y} \\mid \\Xmat] = \\mb{A}\\Xmat\\bfbeta,\n$$\nwhich implies that $\\mb{A}\\Xmat = \\mb{I}_n$. \nRewrite the competitor as $\\widetilde{\\bfbeta} = \\bhat + \\mb{BY}$ where,\n$$ \n\\mb{B} = \\mb{A} - \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'.\n$$\nand note that $\\mb{A}\\Xmat = \\mb{I}_n$ implies that $\\mb{B}\\Xmat = 0$. We now have\n$$ \n\\begin{aligned}\n \\widetilde{\\bfbeta} &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{Y} \\\\\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\mb{B}\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e}\n\\end{aligned}\n$$\nThe variance of the competitor is, thus, \n$$ \n\\begin{aligned}\n \\V[\\widetilde{\\bfbeta} \\mid \\Xmat]\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\V[\\mb{e}\\mid \\Xmat]\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)' \\\\\n &= \\sigma^{2}\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\left( \\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{B}'\\right) \\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\mb{B}' + \\mb{B}\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &\\geq \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1} \\\\\n &= \\V[\\bhat \\mid \\Xmat]\n\\end{aligned}\n$$\nThe first equality comes from the properties of covariance matrices, the second is due to the homoskedasticity assumption, and the fourth is due to $\\mb{B}\\Xmat = 0$, which implies that $\\Xmat'\\mb{B}' = 0$ as well. The fifth inequality holds because matrix products of the form $\\mb{BB}'$ are positive definite if $\\mb{B}$ is of full rank (which we have assumed it is). \n\n:::\n\nIn this proof, we saw that the variance of the competing estimator had variance $\\sigma^2\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)$ which we argued was \"greater than 0\" in the matrix sense, which is also called positive definite. What does this mean practically? Remember that any positive definite matrix must have strictly positive diagonal entries and that the diagonal entries of $\\V[\\bhat \\mid \\Xmat]$ and $V[\\widetilde{\\bfbeta}\\mid \\Xmat]$ are the variances of the individual parameters, $\\V[\\widehat{\\beta}_{j} \\mid \\Xmat]$ and $\\V[\\widetilde{\\beta}_{j} \\mid \\Xmat]$. Thus, the variances of the individual parameters will be larger for $\\widetilde{\\bfbeta}$ than for $\\bhat$.\n\nMany textbooks cite the Gauss-Markov theorem as a critical advantage of OLS over other methods, but recognizing its limitations is essential. It requires linearity and homoskedastic error assumptions, and these can be false in many applications. \n\nFinally, note that while we have shown this result for linear estimators, @Hansen22 proves a more general version of this result that applies to any unbiased estimator. \n\n## The normal linear model\n\nFinally, we add the strongest and thus least loved of the classical linear regression assumption: (conditional) normality of the errors. Historically the reason to use this assumption was that finite-sample inference hits a roadblock without some knowledge of the sampling distribution of $\\bhat$. Under the linear CEF model, we saw that $\\bhat$ is unbiased, and under homoskedasticity, we could produce an unbiased estimator of the conditional variance. But for hypothesis testing or for generating confidence intervals, we need to make probability statements about the estimator, and, for that, we need to know its exact distribution. When the sample size is large, we can rely on the CLT and know $\\bhat$ is approximately normal. But how do we proceed in small samples? Historically we would have assumed (conditional) normality of the errors, basically proceeding with some knowledge that we were wrong but hopefully not too wrong. \n\n\n::: {.callout-note}\n\n## The normal linear regression model\n\nIn addition to the linear CEF assumption, we assume that \n$$\ne_i \\mid \\Xmat \\sim \\N(0, \\sigma^2).\n$$\n\n:::\n\nThere are a couple of important points: \n\n- The assumption here is not that $(Y_{i}, \\X_{i})$ are jointly normal (though this would be sufficient for the assumption to hold), but rather that $Y_i$ is normally distributed conditional on $\\X_i$. \n- Notice that the normal regression model has the homoskedasticity assumption baked in. \n\n::: {#thm-normal-ols}\n\nUnder the normal linear regression model, we have\n$$ \n\\begin{aligned}\n \\bhat \\mid \\Xmat &\\sim \\N\\left(\\bfbeta, \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1}\\right) \\\\\n \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}]_{jj}/\\sqrt{n}} &\\sim t_{n-k-1} \\\\\n W/q &\\sim F_{q, n-k-1}. \n\\end{aligned}\n$$\n\n:::\n\n\nThis theorem says that in the normal linear regression model, the coefficients follow a normal distribution, the t-statistics follow a $t$-distribution, and a transformation of the Wald statistic follows an $F$ distribution. These are **exact** results and do not rely on large-sample approximations. Under the assumption of conditional normality of the errors, the results are as valid for $n = 5$ as for $n = 500,000$. \n\nFew people believe errors follow a normal distribution, so why even present these results? Unfortunately, most statistical software implementations of OLS implicitly assume this when calculating p-values for tests or constructing confidence intervals. In R, for example, the p-value associated with the $t$-statistic reported by `lm()` relies on the $t_{n-k-1}$ distribution, and the critical values used to construct confidence intervals with `confint()` use that distribution as well. When normality does not hold, there is no principled reason to use the $t$ or the $F$ distributions in this way. But we might hold our nose and use this *ad hoc* procedure under two rationalizations:\n\n- $\\bhat$ is asymptotically normal. This approximation might, however, be poor in smaller finite samples. The $t$ distribution will make inference more conservative in these cases (wider confidence intervals, smaller test rejection regions), which might help offset its poor approximation of the normal distribution in small samples. \n- As $n\\to\\infty$, the $t_{n-k-1}$ will converge to a standard normal distribution, so the *ad hoc* adjustment will not matter much for medium to large samples. \n\nThese arguments are not very convincing since whether the $t$ approximation will be any better than the normal in finite samples is unclear. But it may be the best we can do while we go and find more data. \n\n## Summary\n\nIn this chapter, we discussed the large-sample properties of OLS, which are quite strong. Under mild conditions, OLS is consistent for the population linear regression coefficients and is asymptotically normal. The variance of the OLS estimator, and thus the variance estimator, depends on whether the projection errors are assumed to be unrelated to the covariates (**homoskedastic**) or possibly related (**heteroskedastic**). Confidence intervals and hypothesis tests for individual OLS coefficients are largely the same as discussed in Part I of this book, and we can obtain finite-sample properties of OLS such as conditional unbiasedness if we assume the conditional expectation function is linear. If we further assume the errors are normally distributed, we can derive confidence intervals and hypothesis tests that are valid for all sample sizes. \n", "supporting": [ "ols_properties_files/figure-html" ], diff --git a/_freeze/ols_properties/execute-results/tex.json b/_freeze/ols_properties/execute-results/tex.json index 5c83d51..2752a47 100644 --- a/_freeze/ols_properties/execute-results/tex.json +++ b/_freeze/ols_properties/execute-results/tex.json @@ -1,8 +1,7 @@ { - "hash": "f6ce8f5973eb6b0086e9e94e9fa87e6a", + "hash": "9969498f97d6e5e726d77c03cffc4a2a", "result": { - "engine": "knitr", - "markdown": "\n\n# The statistics of least squares {#sec-ols-statistics}\n\nThe last chapter showcased the least squares estimator and investigated many of its more mechanical properties, which are essential for the practical application of OLS. But we still need to understand its statistical properties, as we discussed in Part I of this book: unbiasedness, sampling variance, consistency, and asymptotic normality. As we saw then, these properties fall into finite-sample (unbiasedness, sampling variance) and asymptotic (consistency, asymptotic normality). \n\nIn this chapter, we will focus on the asymptotic properties of OLS because those properties hold under the relatively mild conditions of the linear projection model introduced in @sec-linear-projection. We will see that OLS consistently estimates a coherent quantity of interest (the best linear predictor) regardless of whether the conditional expectation is linear. That is, for the asymptotic properties of the estimator, we will not need the commonly invoked linearity assumption. Later, when we investigate the finite-sample properties, we will show how linearity will help us establish unbiasedness and also how the normality of the errors can allow us to conduct exact, finite-sample inference. But these assumptions are very strong, so understanding what we can say about OLS without them is vital. \n\n## Large-sample properties of OLS\n\nAs we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: (1) a consistent estimate of the variance of $\\bhat$ and (2) the approximate distribution of $\\bhat$ in large samples. Remember that, since $\\bhat$ is a vector, the variance of that estimator will actually be a variance-covariance matrix. To obtain the two key ingredients, we first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which includes its variance. \n\n\nWe begin by setting out the assumptions needed for establishing the large-sample properties of OLS, which are the same as the assumptions needed to ensure that the best linear predictor, $\\bfbeta = \\E[\\X_{i}\\X_{i}']^{-1}\\E[\\X_{i}Y_{i}]$, is well-defined and unique. \n\n::: {.callout-note}\n\n### Linear projection assumptions\n\nThe linear projection model makes the following assumptions:\n\n1. $\\{(Y_{i}, \\X_{i})\\}_{i=1}^n$ are iid random vectors\n\n2. $\\E[Y^{2}_{i}] < \\infty$ (finite outcome variance)\n\n3. $\\E[\\Vert \\X_{i}\\Vert^{2}] < \\infty$ (finite variances and covariances of covariates)\n\n2. $\\E[\\X_{i}\\X_{i}']$ is positive definite (no linear dependence in the covariates)\n:::\n\n\nRecall that these are mild conditions on the joint distribution of $(Y_{i}, \\X_{i})$ and in particular, we are **not** assuming linearity of the CEF, $\\E[Y_{i} \\mid \\X_{i}]$, nor are we assuming any specific distribution for the data. \n\nWe can helpfully decompose the OLS estimator into the actual BLP coefficient plus estimation error as\n$$ \n\\bhat = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_iY_i \\right) = \\bfbeta + \\underbrace{\\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\right)}_{\\text{estimation error}}.\n$$ \n \nThis decomposition will help us quickly establish the consistency of $\\bhat$. By the law of large numbers, we know that sample means will converge in probability to population expectations, so we have\n$$ \n\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\inprob \\E[\\X_i\\X_i'] \\equiv \\mb{Q}_{\\X\\X} \\qquad \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\inprob \\E[\\X_{i} e_{i}] = \\mb{0},\n$$\nwhich implies by the continuous mapping theorem (the inverse is a continuous function) that \n$$\n\\bhat \\inprob \\bfbeta + \\mb{Q}_{\\X\\X}^{-1}\\E[\\X_ie_i] = \\bfbeta,\n$$\nThe linear projection assumptions ensure that the LLN applies to these sample means and that $\\E[\\X_{i}\\X_{i}']$ is invertible. \n\n\n::: {#thm-ols-consistency}\nUnder the above linear projection assumptions, the OLS estimator is consistent for the best linear projection coefficients, $\\bhat \\inprob \\bfbeta$.\n:::\n\nThus, OLS should be close to the population linear regression in large samples under relatively mild conditions. Remember that this may not equal the conditional expectation if the CEF is nonlinear. What we can say is that OLS converges to the best *linear* approximation to the CEF. Of course, this also means that, if the CEF is linear, then OLS will consistently estimate the coefficients of the CEF. \n\nTo emphasize, the only assumptions made about the dependent variable are that it (1) has finite variance and (2) is iid. Under this assumption, the outcome could be continuous, categorical, binary, or event count. \n\n\nNext, we would like to establish an asymptotic normality result for the OLS coefficients. We first review some key ideas about the Central Limit Theorem.\n\n::: {.callout-note}\n\n## CLT reminder\n\nSuppose that we have a function of the data iid random vectors $\\X_1, \\ldots, \\X_n$, $g(\\X_{i})$ where $\\E[g(\\X_{i})] = 0$ and so $\\V[g(\\X_{i})] = \\E[g(\\X_{i})g(\\X_{i})']$. Then if $\\E[\\Vert g(\\X_{i})\\Vert^{2}] < \\infty$, the CLT implies that\n$$ \n\\sqrt{n}\\left(\\frac{1}{n} \\sum_{i=1}^{n} g(\\X_{i}) - \\E[g(\\X_{i})]\\right) = \\frac{1}{\\sqrt{n}} \\sum_{i=1}^{n} g(\\X_{i}) \\indist \\N(0, \\E[g(\\X_{i})g(\\X_{i}')]) \n$$ {#eq-clt-mean-zero}\n:::\n\nWe now manipulate our decomposition to arrive at the *stabilized* version of the estimator,\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\right).\n$$\nRecall that we stabilize an estimator to ensure it has a fixed variance as the sample size grows, allowing it to have a non-degenerate asymptotic distribution. The stabilization works by asymptotically centering it (that is, subtracting the value to which it converges) and multiplying by the square root of the sample size. We have already established that the first term on the right-hand side will converge in probability to $\\mb{Q}_{\\X\\X}^{-1}$. Notice that $\\E[\\X_{i}e_{i}] = 0$, so we can apply @eq-clt-mean-zero to the second term. The covariance matrix of $\\X_ie_{i}$ is \n$$ \n\\mb{\\Omega} = \\V[\\X_{i}e_{i}] = \\E[\\X_{i}e_{i}(\\X_{i}e_{i})'] = \\E[e_{i}^{2}\\X_{i}\\X_{i}'].\n$$ \nThe CLT will imply that\n$$ \n\\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\indist \\N(0, \\mb{\\Omega}).\n$$\nCombining these facts with Slutsky's Theorem implies the following theorem. \n\n::: {#thm-ols-asymptotic-normality}\n\nSuppose that the linear projection assumptions hold and, in addition, we have $\\E[Y_{i}^{4}] < \\infty$ and $\\E[\\lVert\\X_{i}\\rVert^{4}] < \\infty$. Then the OLS estimator is asymptotically normal with\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) \\indist \\N(0, \\mb{V}_{\\bfbeta}),\n$$\nwhere\n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1}.\n$$\n\n:::\n\nThus, with a large enough sample size we can approximate the distribution of $\\bhat$ with a multivariate normal distribution with mean $\\bfbeta$ and covariance matrix $\\mb{V}_{\\bfbeta}/n$. In particular, the square root of the $j$th diagonals of this matrix will be standard errors for $\\widehat{\\beta}_j$. Knowing the shape of the OLS estimator's multivariate distribution will allow us to conduct hypothesis tests and generate confidence intervals for both individual coefficients and groups of coefficients. But, first, we need an estimate of the covariance matrix.\n\n\n\n## Variance estimation for OLS\n\nThe asymptotic normality of OLS from the last section is of limited value without some way to estimate the covariance matrix, \n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1}.\n$$\nSince each term here is a population mean, this is an ideal place in which to drop a plug-in estimator. For now, we will use the following estimators:\n$$ \n\\begin{aligned}\n \\mb{Q}_{\\X\\X} &= \\E[\\X_{i}\\X_{i}'] & \\widehat{\\mb{Q}}_{\\X\\X} &= \\frac{1}{n} \\sum_{i=1}^{n} \\X_{i}\\X_{i}' = \\frac{1}{n}\\Xmat'\\Xmat \\\\\n \\mb{\\Omega} &= \\E[e_i^2\\X_i\\X_i'] & \\widehat{\\mb{\\Omega}} & = \\frac{1}{n}\\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i'.\n\\end{aligned}\n$$\nUnder the assumptions of @thm-ols-asymptotic-normality, the LLN will imply that these are consistent for the quantities we need, $\\widehat{\\mb{Q}}_{\\X\\X} \\inprob \\mb{Q}_{\\X\\X}$ and $\\widehat{\\mb{\\Omega}} \\inprob \\mb{\\Omega}$. We can plug these into the variance formula to arrive at\n$$ \n\\begin{aligned}\n \\widehat{\\mb{V}}_{\\bfbeta} &= \\widehat{\\mb{Q}}_{\\X\\X}^{-1}\\widehat{\\mb{\\Omega}}\\widehat{\\mb{Q}}_{\\X\\X}^{-1} \\\\\n &= \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n\\end{aligned}\n$$\nwhich by the continuous mapping theorem is consistent, $\\widehat{\\mb{V}}_{\\bfbeta} \\inprob \\mb{V}_{\\bfbeta}$. \n\nThis estimator is sometimes called the **robust variance estimator** or, more accurately, the **heteroskedasticity-consistent (HC) variance estimator**. Why is it robust? Consider the standard **homoskedasticity** assumption that most statistical software packages make when estimating OLS variances: the variance of the errors does not depend on the covariates, or $\\V[e_{i}^{2} \\mid \\X_{i}] = \\V[e_{i}^{2}]$. This assumption is stronger than needed, and we can rely on a weaker assumption that the squared errors are uncorrelated with a specific function of the covariates: \n$$ \n\\E[e_{i}^{2}\\X_{i}\\X_{i}'] = \\E[e_{i}^{2}]\\E[\\X_{i}\\X_{i}'] = \\sigma^{2}\\mb{Q}_{\\X\\X}, \n$$\nwhere $\\sigma^2$ is the variance of the residuals (since $\\E[e_{i}] = 0$). Homoskedasticity simplifies the asymptotic variance of the stabilized estimator, $\\sqrt{n}(\\bhat - \\bfbeta)$, to\n$$ \n\\mb{V}^{\\texttt{lm}}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\sigma^{2}\\mb{Q}_{\\X\\X}\\mb{Q}_{\\X\\X}^{-1} = \\sigma^2\\mb{Q}_{\\X\\X}^{-1}.\n$$\nWe already have an estimator for $\\mb{Q}_{\\X\\X}$, but we need one for $\\sigma^2$. We can easily use the SSR,\n$$ \n\\widehat{\\sigma}^{2} = \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\widehat{e}_{i}^{2},\n$$\nwhere we use $n-k-1$ in the denominator instead of $n$ to correct for the residuals being slightly less variable than the actual errors (because OLS mechanically attempts to make the residuals small). For consistent variance estimation, $n-k -1$ or $n$ can be used, since either way $\\widehat{\\sigma}^2 \\inprob \\sigma^2$. Thus, under homoskedasticity, we have\n$$ \n\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}} = \\widehat{\\sigma}^{2}\\left(\\frac{1}{n}\\Xmat'\\Xmat\\right)^{{-1}} = n\\widehat{\\sigma}^{2}\\left(\\Xmat'\\Xmat\\right)^{{-1}},\n$$\nThis is the standard variance estimator used by `lm()` in R and `reg` in Stata. \n\n\nHow do these two estimators, $\\widehat{\\mb{V}}_{\\bfbeta}$ and $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$, compare? Notice that the HC variance estimator and the homoskedasticity variance estimator will both be consistent when homoskedasticity holds. But as the \"heteroskedasticity-consistent\" label implies, only the HC variance estimator will be consistent when homoskedasticity fails to hold. So $\\widehat{\\mb{V}}_{\\bfbeta}$ has the advantage of being consistent regardless of the homoskedasticity assumption. This advantage comes at a cost, however. When homoskedasticity is correct, $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$ incorporates that assumption into the estimator whereas the HC variance estimator has to estimate it. The HC estimator will therefore have higher variance (the variance estimator will be more variable!) when homoskedasticity actually does hold. \n\n\n\n\n\nNow that we have established the asymptotic normality of the OLS estimator and developed a consistent estimator of its variance, we can proceed with all of the statistical inference tools we discussed in Part I, including hypothesis tests and confidence intervals. \n\nWe begin by defining the estimated **heteroskedasticity-consistent standard errors** as\n$$ \n\\widehat{\\se}(\\widehat{\\beta}_{j}) = \\sqrt{\\frac{[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}}{n}},\n$$\nwhere $[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}$ is the $j$th diagonal entry of the HC variance estimator. Note that we divide by $\\sqrt{n}$ here because $\\widehat{\\mb{V}}_{\\bfbeta}$ is a consistent estimator of the stabilized estimator $\\sqrt{n}(\\bhat - \\bfbeta)$ not the estimator itself. \n\nHypothesis tests and confidence intervals for individual coefficients are almost precisely the same as with the most general case presented in Part I. For a two-sided test of $H_0: \\beta_j = b$ versus $H_1: \\beta_j \\neq b$, we can build the t-statistic and conclude that, under the null,\n$$\n\\frac{\\widehat{\\beta}_j - b}{\\widehat{\\se}(\\widehat{\\beta}_{j})} \\indist \\N(0, 1).\n$$\nStatistical software will typically and helpfully provide the t-statistic for the null hypothesis of no (partial) linear relationship between $X_{ij}$ and $Y_i$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nwhich measures how large the estimated coefficient is in standard errors. With $\\alpha = 0.05$, asymptotic normality would imply that we reject this null when $t > 1.96$. We can form asymptotically-valid confidence intervals with \n$$ \n\\left[\\widehat{\\beta}_{j} - z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j}),\\;\\widehat{\\beta}_{j} + z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j})\\right]. \n$$\nFor reasons we will discuss below, standard software typically relies on the $t$ distribution instead of the normal for hypothesis testing and confidence intervals. Still, this difference is of little consequence in large samples. \n\n## Inference for multiple parameters\n\nWith multiple coefficients, we might have hypotheses that involve more than one coefficient. As an example, consider a regression with an interaction between two covariates, \n$$\nY_i = \\beta_0 + X_i\\beta_1 + Z_i\\beta_2 + X_iZ_i\\beta_3 + e_i.\n$$\nSuppose we wanted to test the hypothesis that $X_i$ does not affect the best linear predictor for $Y_i$. That would be\n$$ \nH_{0}: \\beta_{1} = 0 \\text{ and } \\beta_{3} = 0\\quad\\text{vs}\\quad H_{1}: \\beta_{1} \\neq 0 \\text{ or } \\beta_{3} \\neq 0,\n$$\nwhere we usually write the null more compactly as $H_0: \\beta_1 = \\beta_3 = 0$. \n\nTo test this null hypothesis, we need a test statistic that discriminates between the two hypotheses: it should be large when the alternative is true and small enough when the null is true. With a single coefficient, we usually test the null hypothesis of $H_0: \\beta_j = b_0$ with the $t$-statistic, \n$$ \nt = \\frac{\\widehat{\\beta}_{j} - b_{0}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nand we usually take the absolute value, $|t|$, as our measure of how extreme our estimate is given the null distribution. But notice that we could also use the square of the $t$ statistic, which is\n$$ \nt^{2} = \\frac{\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{\\V[\\widehat{\\beta}_{j}]} = \\frac{n\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{[\\mb{V}_{\\bfbeta}]_{[jj]}}. \n$$ {#eq-squared-t}\n\nWhile $|t|$ is the usual test statistic we use for two-sided tests, we could equivalently use $t^2$ and arrive at the exact same conclusions (as long as we knew the distribution of $t^2$ under the null hypothesis). It turns out that the $t^2$ version of the test statistic will generalize more easily to comparing multiple coefficients. This version of the test statistic suggests another general way to differentiate the null from the alternative: by taking the squared distance between them and dividing by the variance of the estimate. \n\nCan we generalize this idea to hypotheses about multiple parameters? Adding the sum of squared distances for each component of the null hypothesis is straightforward. For our interaction example, that would be\n$$ \n\\widehat{\\beta}_1^2 + \\widehat{\\beta}_3^2, \n$$\nRemember, however, that some of the estimated coefficients are noisier than others, so we should account for the uncertainty just like we did for the $t$-statistic. \n\nWith multiple parameters and multiple coefficients, the variances will now require matrix algebra. We can write any hypothesis about linear functions of the coefficients as $H_{0}: \\mb{L}\\bfbeta = \\mb{c}$. For example, in the interaction case, we have\n$$ \n\\mb{L} =\n\\begin{pmatrix}\n 0 & 1 & 0 & 0 \\\\\n 0 & 0 & 0 & 1 \\\\\n\\end{pmatrix}\n\\qquad\n\\mb{c} =\n\\begin{pmatrix}\n 0 \\\\\n 0\n\\end{pmatrix}\n$$\nThus, $\\mb{L}\\bfbeta = \\mb{0}$ is equivalent to $\\beta_1 = 0$ and $\\beta_3 = 0$. Notice that with other $\\mb{L}$ matrices, we could represent more complicated hypotheses like $2\\beta_1 - \\beta_2 = 34$, though we mostly stick to simpler functions. Let $\\widehat{\\bs{\\theta}} = \\mb{L}\\bhat$ be the OLS estimate of the function of the coefficients. By the delta method (discussed in @sec-delta-method), we have\n$$ \n\\sqrt{n}\\left(\\mb{L}\\bhat - \\mb{L}\\bfbeta\\right) \\indist \\N(0, \\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}).\n$$\nWe can now generalize the squared $t$ statistic in @eq-squared-t by taking the distances $\\mb{L}\\bhat - \\mb{c}$ weighted by the variance-covariance matrix $\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}$, \n$$ \nW = n(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}(\\mb{L}\\bhat - \\mb{c}),\n$$\nwhich is called the **Wald test statistic**. This statistic generalizes the ideas of the t-statistic to multiple parameters. With the t-statistic, we recenter to have mean 0 and divide by the standard error to get a variance of 1. If we ignore the middle variance weighting, we have $(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}\\bhat - \\mb{c})$ which is just the sum of the squared deviations of the estimates from the null. Including the $(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}$ weight has the effect of rescaling the distribution of $\\mb{L}\\bhat - \\mb{c}$ to make it rotationally symmetric around 0 (so the resulting dimensions are uncorrelated) with each dimension having an equal variance of 1. In this way, the Wald statistic transforms the random vectors to be mean-centered and have variance 1 (just the t-statistic), but also to have the resulting random variables in the vector be uncorrelated.[^norms]\n\n\n[^norms]: The form of the Wald statistic is that of a weighted inner product, $\\mb{x}'\\mb{Ay}$, where $\\mb{A}$ is a symmetric positive-definite weighting matrix. \n\nWhy transform the data in this way? @fig-wald shows the contour plot of a hypothetical joint distribution of two coefficients from an OLS regression. We might want to know the distance between different points in the distribution and the mean, which in this case is $(1, 2)$. Without considering the joint distribution, the circle is obviously closer to the mean than the triangle. However, looking at the two points on the distribution, the circle is at a lower contour than the triangle, meaning it is more extreme than the triangle for this particular distribution. The Wald statistic, then, takes into consideration how much of a \"climb\" it is for $\\mb{L}\\bhat$ to get to $\\mb{c}$ given the distribution of $\\mb{L}\\bhat$.\n\n\n\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of two slope coefficients. The circle is closer to the center of the distribution by the standard Euclidean distance, but the triangle is closer once you consider the joint distribution.](ols_properties_files/figure-pdf/fig-wald-1.pdf){#fig-wald}\n:::\n:::\n\n\n\n\n\n\n\nIf $\\mb{L}$ only has one row, our Wald statistic is the same as the squared $t$ statistic, $W = t^2$. This fact will help us think about the asymptotic distribution of $W$. Note that as $n\\to\\infty$, we know that by the asymptotic normality of $\\bhat$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{\\widehat{\\se}[\\widehat{\\beta}_{j}]} \\indist \\N(0,1)\n$$\nso $t^2$ will converge in distribution to a $\\chi^2_1$ (since a $\\chi^2_1$ distribution is just one standard normal distribution squared). After recentering and rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\\mb{L}\\bhat = \\mb{c}$, we have $W \\indist \\chi^2_{q}$. \n\n\nWe need to define the rejection region to use the Wald statistic in a hypothesis test. Because we are squaring each distance in $W \\geq 0$, larger values of $W$ indicate more disagreement with the null in either direction. Thus, for an $\\alpha$-level test of the joint null, we only need a one-sided rejection region of the form $\\P(W > w_{\\alpha}) = \\alpha$. Obtaining these values is straightforward (see the above callout tip). For $q = 2$ and a $\\alpha = 0.05$, the critical value is roughly 6. \n\n\n\n::: {.callout-note}\n\n## Chi-squared critical values\n\nWe can obtain critical values for the $\\chi^2_q$ distribution using the `qchisq()` function in R. For example, if we wanted to obtain the critical value $w$ such that $\\P(W > w_{\\alpha}) = \\alpha$ for our two-parameter interaction example, we could use:\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqchisq(p = 0.95, df = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 5.991465\n```\n\n\n:::\n:::\n\n\n\n\n\n:::\n\n\nThe Wald statistic is not a common test provided by standard statistical software functions like `lm()` in R, though it is fairly straightforward to implement \"by hand.\" Alternatively, packages like [`{aod}`](https://cran.r-project.org/web/packages/aod/index.html) or [`{clubSandwich}`](http://jepusto.github.io/clubSandwich/) have implementations of the test. What is reported by most software implementations of OLS (like `lm()` in R) is the F-statistic, which is\n$$ \nF = \\frac{W}{q}.\n$$\nThis also typically uses the homoskedastic variance estimator $\\mb{V}^{\\texttt{lm}}_{\\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution has no justification in statistical theory, but it is slightly more conservative than the $\\chi^2_q$ distribution, and the inferences from the $F$ statistic will converge to those from the $\\chi^2_q$ distribution as $n\\to\\infty$. So it might be justified as an *ad hoc* small-sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and we have, say, a sample size of $n = 100$, then in that case, the critical value for the F test with $\\alpha = 0.05$ is\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqf(0.95, df1 = 2, df2 = 100 - 4)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.091191\n```\n\n\n:::\n:::\n\n\n\n\n\n\nThis result implies a critical value of 6.182 on the scale of the Wald statistic (multiplying it by $q = 2$). Compared to the earlier critical value of 5.991 based on the $\\chi^2_2$ distribution, we can see that the inferences will be very similar even in moderately-sized datasets. \n\nFinally, note that the F-statistic reported by `lm()` in R is the test of all the coefficients being equal to 0 jointly except for the intercept. In modern quantitative social sciences, this test is seldom substantively interesting. \n\n\n## Finite-sample properties with a linear CEF\n\nAll the above results have been large-sample properties, and we have not addressed finite-sample properties like the sampling variance or unbiasedness. Under the linear projection assumption above, OLS is generally biased without stronger assumptions. This section introduces the stronger assumption that will allow us to establish stronger properties for OLS. As usual, however, remember that these stronger assumptions can be wrong. \n\n\n::: {.callout-note}\n## Assumption: Linear Regression Model\n1. The variables $(Y_{i}, \\X_{i})$ satisfy the linear CEF assumption.\n$$ \n\\begin{aligned}\n Y_{i} &= \\X_{i}'\\bfbeta + e_{i} \\\\\n \\E[e_{i}\\mid \\X_{i}] & = 0.\n\\end{aligned}\n$$\n\n2. The design matrix is invertible $\\E[\\X_{i}\\X_{i}'] > 0$ (positive definite).\n:::\n\n\nWe discussed the concept of a linear CEF extensively in @sec-regression. However, recall that the CEF might be linear mechanically if the model is **saturated** or when there are as many coefficients in the model as there are unique values of $\\X_i$. When a model is not saturated, the linear CEF assumption is just that: an assumption. What can this assumption do? It can aid in establishing some nice statistical properties in finite samples. \n\nBefore proceeding, note that, when focusing on the finite sample inference for OLS, we focused on its properties **conditional on the observed covariates**, such as $\\E[\\bhat \\mid \\Xmat]$ or $\\V[\\bhat \\mid \\Xmat]$. The historical reason for this is that the researcher often chose these independent variables and so they were not random. Thus, sometimes $\\Xmat$ is treated as \"fixed\" in some older texts, which might even omit explicit conditioning statements. \n\n\n::: {#thm-ols-unbiased}\n\nUnder the linear regression model assumption, OLS is unbiased for the population regression coefficients, \n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta,\n$$\nand its conditional sampling variance is\n$$\n\\mb{\\V}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nwhere $\\sigma^2_{i} = \\E[e_{i}^{2} \\mid \\Xmat]$. \n:::\n\n\n::: {.proof}\n\nTo prove the conditional unbiasedness, recall that we can write the OLS estimator as\n$$\n\\bhat = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e},\n$$\nand so taking (conditional) expectations, we have\n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta + \\E[(\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\E[\\mb{e} \\mid \\Xmat] = \\bfbeta,\n$$\nbecause under the linear CEF assumption $\\E[\\mb{e}\\mid \\Xmat] = 0$. \n\nFor the conditional sampling variance, we can use the same decomposition we have,\n$$\n\\V[\\bhat \\mid \\Xmat] = \\V[\\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = (\\Xmat'\\Xmat)^{-1}\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat(\\Xmat'\\Xmat)^{-1}. \n$$\nSince $\\E[\\mb{e}\\mid \\Xmat] = 0$, we know that $\\V[\\mb{e}\\mid \\Xmat] = \\E[\\mb{ee}' \\mid \\Xmat]$, which is a matrix with diagonal entries $\\E[e_{i}^{2} \\mid \\Xmat] = \\sigma^2_i$ and off-diagonal entries $\\E[e_{i}e_{j} \\Xmat] = \\E[e_{i}\\mid \\Xmat]\\E[e_{j}\\mid\\Xmat] = 0$, where the first equality follows from the independence of the errors across units. Thus, $\\V[\\mb{e} \\mid \\Xmat]$ is a diagonal matrix with $\\sigma^2_i$ along the diagonal, which means\n$$\n\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat = \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i',\n$$\nestablishing the conditional sampling variance.\n \n:::\n\nThis means that, for any realization of the covariates, $\\Xmat$, OLS is unbiased for the true regression coefficients $\\bfbeta$. By the law of iterated expectation, we also know that it is unconditionally unbiased[^unconditional] as well since\n$$\n\\E[\\bhat] = \\E[\\E[\\bhat \\mid \\Xmat]] = \\bfbeta. \n$$\nThe difference between these two statements usually isn't incredibly meaningful. \n\n[^unconditional]: We are basically ignoring some edge cases when it comes to discrete covariates here. In particular, we assume that $\\Xmat'\\Xmat$ is nonsingular with probability one. However, this assumption can fail if we have a binary covariate since there is some chance (however slight) that the entire column will be all ones or all zeros, which would lead to a singular matrix $\\Xmat'\\Xmat$. Practically this is not a big deal, but it does mean that we have to ignore this issue theoretically or focus on conditional unbiasedness. \n\n\nThere are a lot of variances flying around, so reviewing them is helpful. Above, we derived the asymptotic variance of $\\mb{Z}_{n} = \\sqrt{n}(\\bhat - \\bfbeta)$, \n$$\n\\mb{V}_{\\bfbeta} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1},\n$$\nwhich implies that the approximate variance of $\\bhat$ will be $\\mb{V}_{\\bfbeta} / n$ because\n$$\n\\bhat = \\frac{Z_n}{\\sqrt{n}} + \\bfbeta \\quad\\implies\\quad \\bhat \\overset{a}{\\sim} \\N(\\bfbeta, n^{-1}\\mb{V}_{\\bfbeta}),\n$$\nwhere $\\overset{a}{\\sim}$ means asymptotically distributed as. Under the linear CEF, the conditional sampling variance of $\\bhat$ has a similar form and will be similar to the \n$$\n\\mb{V}_{\\bhat} = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\approx \\mb{V}_{\\bfbeta} / n.\n$$\nIn practice, these two derivations lead to basically the same variance estimator. Recall that the heteroskedastic-consistent variance estimator\n$$\n\\widehat{\\mb{V}}_{\\bfbeta} = \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n$$\nis a valid plug-in estimator for the asymptotic variance and\n$$\n\\widehat{\\mb{V}}_{\\bhat} = n^{-1}\\widehat{\\mb{V}}_{\\bfbeta}.\n$$\nThus, in practice, the asymptotic and finite-sample results under a linear CEF justify the same variance estimator. \n\n\n### Linear CEF model under homoskedasticity\n\nIf we are willing to assume that the standard errors are homoskedastic, we can derive even stronger results for OLS. Stronger assumptions typically lead to stronger conclusions, but, obviously, those conclusions may not be robust to assumption violations. But homoskedasticity of errors is such a historically important assumption that statistical software implementations of OLS like `lm()` in R assume it by default. \n\n::: {.callout-note}\n\n## Assumption: Homoskedasticity with a linear CEF\n\nIn addition to the linear CEF assumption, we further assume that\n$$\n\\E[e_i^2 \\mid \\X_i] = \\E[e_i^2] = \\sigma^2,\n$$\nor that variance of the errors does not depend on the covariates. \n:::\n\n\n::: {#thm-homoskedasticity}\n\nUnder a linear CEF model with homoskedastic errors, the conditional sampling variance is\n$$\n\\mb{V}^{\\texttt{lm}}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\sigma^2 \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nand the variance estimator \n$$\n\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} = \\widehat{\\sigma}^2 \\left( \\Xmat'\\Xmat \\right)^{-1} \\quad\\text{where,}\\quad \\widehat{\\sigma}^2 = \\frac{1}{n - k - 1} \\sum_{i=1}^n \\widehat{e}_i^2\n$$\nis unbiased, $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n::: \n\n::: {.proof}\nUnder homoskedasticity $\\sigma^2_i = \\sigma^2$ for all $i$. Recall that $\\sum_{i=1}^n \\X_i\\X_i' = \\Xmat'\\Xmat$. Thus, the conditional sampling variance from @thm-ols-unbiased, \n$$ \n\\begin{aligned}\n\\V[\\bhat \\mid \\Xmat] &= \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2 \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\ &= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\Xmat'\\Xmat \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1} = \\mb{V}^{\\texttt{lm}}_{\\bhat}.\n\\end{aligned}\n$$\n\nFor unbiasedness, we just need to show that $\\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] = \\sigma^2$. Recall that we defined $\\mb{M}_{\\Xmat}$ as the residual-maker because $\\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}}$. We can use this to connect the residuals to the standard errors,\n$$ \n\\mb{M}_{\\Xmat}\\mb{e} = \\mb{M}_{\\Xmat}\\mb{Y} - \\mb{M}_{\\Xmat}\\Xmat\\bfbeta = \\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}},\n$$ \nso \n$$\n\\V[\\widehat{\\mb{e}} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\V[\\mb{e} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\sigma^2,\n$$\nwhere the first equality holds because $\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'$ is constant conditional on $\\Xmat$. Notice that the diagonal entries of this matrix are the variances of particular residuals $\\widehat{e}_i$ and that the diagonal entries of the annihilator matrix are $1 - h_{ii}$ (since the $h_{ii}$ are the diagonal entries of $\\mb{P}_{\\Xmat}$). Thus, we have\n$$ \n\\V[\\widehat{e}_i \\mid \\Xmat] = \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] = (1 - h_{ii})\\sigma^{2}.\n$$\nIn the last chapter in @sec-leverage, we established that one property of these leverage values is $\\sum_{i=1}^n h_{ii} = k+ 1$, so $\\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have\n$$ \n\\begin{aligned}\n \\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] &= \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] \\\\\n &= \\frac{\\sigma^{2}}{n-k-1} \\sum_{i=1}^{n} 1 - h_{ii} \\\\\n &= \\sigma^{2}. \n\\end{aligned}\n$$\nThis establishes $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n\n:::\n\n\nThus, under the linear CEF model and homoskedasticity of the errors, we have an unbiased variance estimator that is a simple function of the sum of squared residuals and the design matrix. Most statistical software packages estimate standard errors using $\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}$. \n\n\nThe final result we can derive for the linear CEF under the homoskedasticity assumption is an optimality result. That is, we might ask if there is another estimator for $\\bfbeta$ that would outperform OLS in the sense of having a lower sampling variance. Perhaps surprisingly, no linear estimator for $\\bfbeta$ has a lower conditional variance, meaning that OLS is the **best linear unbiased estimator**, often jovially shortened to BLUE. This result is famously known as the Gauss-Markov Theorem.\n\n::: {#thm-gauss-markov}\n\nLet $\\widetilde{\\bfbeta} = \\mb{AY}$ be a linear and unbiased estimator for $\\bfbeta$. Under the linear CEF model with homoskedastic errors, \n$$\n\\V[\\widetilde{\\bfbeta}\\mid \\Xmat] \\geq \\V[\\bhat \\mid \\Xmat]. \n$$\n\n:::\n\n::: {.proof}\nNote that if $\\widetilde{\\bfbeta}$ is unbiased then $\\E[\\widetilde{\\bfbeta} \\mid \\Xmat] = \\bfbeta$ and so \n$$\n\\bfbeta = \\E[\\mb{AY} \\mid \\Xmat] = \\mb{A}\\E[\\mb{Y} \\mid \\Xmat] = \\mb{A}\\Xmat\\bfbeta,\n$$\nwhich implies that $\\mb{A}\\Xmat = \\mb{I}_n$. \nRewrite the competitor as $\\widetilde{\\bfbeta} = \\bhat + \\mb{BY}$ where,\n$$ \n\\mb{B} = \\mb{A} - \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'.\n$$\nand note that $\\mb{A}\\Xmat = \\mb{I}_n$ implies that $\\mb{B}\\Xmat = 0$. We now have\n$$ \n\\begin{aligned}\n \\widetilde{\\bfbeta} &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{Y} \\\\\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\mb{B}\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e}\n\\end{aligned}\n$$\nThe variance of the competitor is, thus, \n$$ \n\\begin{aligned}\n \\V[\\widetilde{\\bfbeta} \\mid \\Xmat]\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\V[\\mb{e}\\mid \\Xmat]\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)' \\\\\n &= \\sigma^{2}\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\left( \\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{B}'\\right) \\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\mb{B}' + \\mb{B}\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &\\geq \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1} \\\\\n &= \\V[\\bhat \\mid \\Xmat]\n\\end{aligned}\n$$\nThe first equality comes from the properties of covariance matrices, the second is due to the homoskedasticity assumption, and the fourth is due to $\\mb{B}\\Xmat = 0$, which implies that $\\Xmat'\\mb{B}' = 0$ as well. The fifth inequality holds because matrix products of the form $\\mb{BB}'$ are positive definite if $\\mb{B}$ is of full rank (which we have assumed it is). \n\n:::\n\nIn this proof, we saw that the variance of the competing estimator had variance $\\sigma^2\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)$ which we argued was \"greater than 0\" in the matrix sense, which is also called positive definite. What does this mean practically? Remember that any positive definite matrix must have strictly positive diagonal entries and that the diagonal entries of $\\V[\\bhat \\mid \\Xmat]$ and $V[\\widetilde{\\bfbeta}\\mid \\Xmat]$ are the variances of the individual parameters, $\\V[\\widehat{\\beta}_{j} \\mid \\Xmat]$ and $\\V[\\widetilde{\\beta}_{j} \\mid \\Xmat]$. Thus, the variances of the individual parameters will be larger for $\\widetilde{\\bfbeta}$ than for $\\bhat$.\n\nMany textbooks cite the Gauss-Markov theorem as a critical advantage of OLS over other methods, but recognizing its limitations is essential. It requires linearity and homoskedastic error assumptions, and these can be false in many applications. \n\nFinally, note that while we have shown this result for linear estimators, @Hansen22 proves a more general version of this result that applies to any unbiased estimator. \n\n## The normal linear model\n\nFinally, we add the strongest and thus least loved of the classical linear regression assumption: (conditional) normality of the errors. Historically the reason to use this assumption was that finite-sample inference hits a roadblock without some knowledge of the sampling distribution of $\\bhat$. Under the linear CEF model, we saw that $\\bhat$ is unbiased, and under homoskedasticity, we could produce an unbiased estimator of the conditional variance. But for hypothesis testing or for generating confidence intervals, we need to make probability statements about the estimator, and, for that, we need to know its exact distribution. When the sample size is large, we can rely on the CLT and know $\\bhat$ is approximately normal. But how do we proceed in small samples? Historically we would have assumed (conditional) normality of the errors, basically proceeding with some knowledge that we were wrong but hopefully not too wrong. \n\n\n::: {.callout-note}\n\n## The normal linear regression model\n\nIn addition to the linear CEF assumption, we assume that \n$$\ne_i \\mid \\Xmat \\sim \\N(0, \\sigma^2).\n$$\n\n:::\n\nThere are a couple of important points: \n\n- The assumption here is not that $(Y_{i}, \\X_{i})$ are jointly normal (though this would be sufficient for the assumption to hold), but rather that $Y_i$ is normally distributed conditional on $\\X_i$. \n- Notice that the normal regression model has the homoskedasticity assumption baked in. \n\n::: {#thm-normal-ols}\n\nUnder the normal linear regression model, we have\n$$ \n\\begin{aligned}\n \\bhat \\mid \\Xmat &\\sim \\N\\left(\\bfbeta, \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1}\\right) \\\\\n \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}]_{jj}/\\sqrt{n}} &\\sim t_{n-k-1} \\\\\n W/q &\\sim F_{q, n-k-1}. \n\\end{aligned}\n$$\n\n:::\n\n\nThis theorem says that in the normal linear regression model, the coefficients follow a normal distribution, the t-statistics follow a $t$-distribution, and a transformation of the Wald statistic follows an $F$ distribution. These are **exact** results and do not rely on large-sample approximations. Under the assumption of conditional normality of the errors, the results are as valid for $n = 5$ as for $n = 500,000$. \n\nFew people believe errors follow a normal distribution, so why even present these results? Unfortunately, most statistical software implementations of OLS implicitly assume this when calculating p-values for tests or constructing confidence intervals. In R, for example, the p-value associated with the $t$-statistic reported by `lm()` relies on the $t_{n-k-1}$ distribution, and the critical values used to construct confidence intervals with `confint()` use that distribution as well. When normality does not hold, there is no principled reason to use the $t$ or the $F$ distributions in this way. But we might hold our nose and use this *ad hoc* procedure under two rationalizations:\n\n- $\\bhat$ is asymptotically normal. This approximation might, however, be poor in smaller finite samples. The $t$ distribution will make inference more conservative in these cases (wider confidence intervals, smaller test rejection regions), which might help offset its poor approximation of the normal distribution in small samples. \n- As $n\\to\\infty$, the $t_{n-k-1}$ will converge to a standard normal distribution, so the *ad hoc* adjustment will not matter much for medium to large samples. \n\nThese arguments are not very convincing since whether the $t$ approximation will be any better than the normal in finite samples is unclear. But it may be the best we can do while we go and find more data. \n\n## Summary\n\nIn this chapter, we discussed the large-sample properties of OLS, which are quite strong. Under mild conditions, OLS is consistent for the population linear regression coefficients and is asymptotically normal. The variance of the OLS estimator, and thus the variance estimator, depends on whether the projection errors are assumed to be unrelated to the covariates (**homoskedastic**) or possibly related (**heteroskedastic**). Confidence intervals and hypothesis tests for individual OLS coefficients are largely the same as discussed in Part I of this book, and we can obtain finite-sample properties of OLS such as conditional unbiasedness if we assume the conditional expectation function is linear. If we further assume the errors are normally distributed, we can derive confidence intervals and hypothesis tests that are valid for all sample sizes. \n", + "markdown": "\n\n# The statistics of least squares {#sec-ols-statistics}\n\nThe last chapter showcased the least squares estimator and investigated many of its more mechanical properties, which are essential for the practical application of OLS. But we still need to understand its statistical properties, as we discussed in Part I of this book: unbiasedness, sampling variance, consistency, and asymptotic normality. As we saw then, these properties fall into finite-sample (unbiasedness, sampling variance) and asymptotic (consistency, asymptotic normality). \n\nIn this chapter, we will focus on the asymptotic properties of OLS because those properties hold under the relatively mild conditions of the linear projection model introduced in @sec-linear-projection. We will see that OLS consistently estimates a coherent quantity of interest (the best linear predictor) regardless of whether the conditional expectation is linear. That is, for the asymptotic properties of the estimator, we will not need the commonly invoked linearity assumption. Later, when we investigate the finite-sample properties, we will show how linearity will help us establish unbiasedness and also how the normality of the errors can allow us to conduct exact, finite-sample inference. But these assumptions are very strong, so understanding what we can say about OLS without them is vital. \n\n## Large-sample properties of OLS\n\nAs we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: (1) a consistent estimate of the variance of $\\bhat$ and (2) the approximate distribution of $\\bhat$ in large samples. Remember that, since $\\bhat$ is a vector, the variance of that estimator will actually be a variance-covariance matrix. To obtain the two key ingredients, we first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which includes its variance. \n\n\nWe begin by setting out the assumptions needed for establishing the large-sample properties of OLS, which are the same as the assumptions needed to ensure that the best linear predictor, $\\bfbeta = \\E[\\X_{i}\\X_{i}']^{-1}\\E[\\X_{i}Y_{i}]$, is well-defined and unique. \n\n::: {.callout-note}\n\n### Linear projection assumptions\n\nThe linear projection model makes the following assumptions:\n\n1. $\\{(Y_{i}, \\X_{i})\\}_{i=1}^n$ are iid random vectors\n\n2. $\\E[Y^{2}_{i}] < \\infty$ (finite outcome variance)\n\n3. $\\E[\\Vert \\X_{i}\\Vert^{2}] < \\infty$ (finite variances and covariances of covariates)\n\n2. $\\E[\\X_{i}\\X_{i}']$ is positive definite (no linear dependence in the covariates)\n:::\n\n\nRecall that these are mild conditions on the joint distribution of $(Y_{i}, \\X_{i})$ and in particular, we are **not** assuming linearity of the CEF, $\\E[Y_{i} \\mid \\X_{i}]$, nor are we assuming any specific distribution for the data. \n\nWe can helpfully decompose the OLS estimator into the actual BLP coefficient plus estimation error as\n$$ \n\\bhat = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_iY_i \\right) = \\bfbeta + \\underbrace{\\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\right)}_{\\text{estimation error}}.\n$$ \n \nThis decomposition will help us quickly establish the consistency of $\\bhat$. By the law of large numbers, we know that sample means will converge in probability to population expectations, so we have\n$$ \n\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\inprob \\E[\\X_i\\X_i'] \\equiv \\mb{Q}_{\\X\\X} \\qquad \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\inprob \\E[\\X_{i} e_{i}] = \\mb{0},\n$$\nwhich implies by the continuous mapping theorem (the inverse is a continuous function) that \n$$\n\\bhat \\inprob \\bfbeta + \\mb{Q}_{\\X\\X}^{-1}\\E[\\X_ie_i] = \\bfbeta,\n$$\nThe linear projection assumptions ensure that the LLN applies to these sample means and that $\\E[\\X_{i}\\X_{i}']$ is invertible. \n\n\n::: {#thm-ols-consistency}\nUnder the above linear projection assumptions, the OLS estimator is consistent for the best linear projection coefficients, $\\bhat \\inprob \\bfbeta$.\n:::\n\nThus, OLS should be close to the population linear regression in large samples under relatively mild conditions. Remember that this may not equal the conditional expectation if the CEF is nonlinear. What we can say is that OLS converges to the best *linear* approximation to the CEF. Of course, this also means that, if the CEF is linear, then OLS will consistently estimate the coefficients of the CEF. \n\nTo emphasize, the only assumptions made about the dependent variable are that it (1) has finite variance and (2) is iid. Under this assumption, the outcome could be continuous, categorical, binary, or event count. \n\n\nNext, we would like to establish an asymptotic normality result for the OLS coefficients. We first review some key ideas about the Central Limit Theorem.\n\n::: {.callout-note}\n\n## CLT reminder\n\nSuppose that we have a function of the data iid random vectors $\\X_1, \\ldots, \\X_n$, $g(\\X_{i})$ where $\\E[g(\\X_{i})] = 0$ and so $\\V[g(\\X_{i})] = \\E[g(\\X_{i})g(\\X_{i})']$. Then if $\\E[\\Vert g(\\X_{i})\\Vert^{2}] < \\infty$, the CLT implies that\n$$ \n\\sqrt{n}\\left(\\frac{1}{n} \\sum_{i=1}^{n} g(\\X_{i}) - \\E[g(\\X_{i})]\\right) = \\frac{1}{\\sqrt{n}} \\sum_{i=1}^{n} g(\\X_{i}) \\indist \\N(0, \\E[g(\\X_{i})g(\\X_{i}')]) \n$$ {#eq-clt-mean-zero}\n:::\n\nWe now manipulate our decomposition to arrive at the *stabilized* version of the estimator,\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\right).\n$$\nRecall that we stabilize an estimator to ensure it has a fixed variance as the sample size grows, allowing it to have a non-degenerate asymptotic distribution. The stabilization works by asymptotically centering it (that is, subtracting the value to which it converges) and multiplying by the square root of the sample size. We have already established that the first term on the right-hand side will converge in probability to $\\mb{Q}_{\\X\\X}^{-1}$. Notice that $\\E[\\X_{i}e_{i}] = 0$, so we can apply @eq-clt-mean-zero to the second term. The covariance matrix of $\\X_ie_{i}$ is \n$$ \n\\mb{\\Omega} = \\V[\\X_{i}e_{i}] = \\E[\\X_{i}e_{i}(\\X_{i}e_{i})'] = \\E[e_{i}^{2}\\X_{i}\\X_{i}'].\n$$ \nThe CLT will imply that\n$$ \n\\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\indist \\N(0, \\mb{\\Omega}).\n$$\nCombining these facts with Slutsky's Theorem implies the following theorem. \n\n::: {#thm-ols-asymptotic-normality}\n\nSuppose that the linear projection assumptions hold and, in addition, we have $\\E[Y_{i}^{4}] < \\infty$ and $\\E[\\lVert\\X_{i}\\rVert^{4}] < \\infty$. Then the OLS estimator is asymptotically normal with\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) \\indist \\N(0, \\mb{V}_{\\bfbeta}),\n$$\nwhere\n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1}.\n$$\n\n:::\n\nThus, with a large enough sample size we can approximate the distribution of $\\bhat$ with a multivariate normal distribution with mean $\\bfbeta$ and covariance matrix $\\mb{V}_{\\bfbeta}/n$. In particular, the square root of the $j$th diagonals of this matrix will be standard errors for $\\widehat{\\beta}_j$. Knowing the shape of the OLS estimator's multivariate distribution will allow us to conduct hypothesis tests and generate confidence intervals for both individual coefficients and groups of coefficients. But, first, we need an estimate of the covariance matrix.\n\n\n\n## Variance estimation for OLS\n\nThe asymptotic normality of OLS from the last section is of limited value without some way to estimate the covariance matrix, \n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1}.\n$$\nSince each term here is a population mean, this is an ideal place in which to drop a plug-in estimator. For now, we will use the following estimators:\n$$ \n\\begin{aligned}\n \\mb{Q}_{\\X\\X} &= \\E[\\X_{i}\\X_{i}'] & \\widehat{\\mb{Q}}_{\\X\\X} &= \\frac{1}{n} \\sum_{i=1}^{n} \\X_{i}\\X_{i}' = \\frac{1}{n}\\Xmat'\\Xmat \\\\\n \\mb{\\Omega} &= \\E[e_i^2\\X_i\\X_i'] & \\widehat{\\mb{\\Omega}} & = \\frac{1}{n}\\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i'.\n\\end{aligned}\n$$\nUnder the assumptions of @thm-ols-asymptotic-normality, the LLN will imply that these are consistent for the quantities we need, $\\widehat{\\mb{Q}}_{\\X\\X} \\inprob \\mb{Q}_{\\X\\X}$ and $\\widehat{\\mb{\\Omega}} \\inprob \\mb{\\Omega}$. We can plug these into the variance formula to arrive at\n$$ \n\\begin{aligned}\n \\widehat{\\mb{V}}_{\\bfbeta} &= \\widehat{\\mb{Q}}_{\\X\\X}^{-1}\\widehat{\\mb{\\Omega}}\\widehat{\\mb{Q}}_{\\X\\X}^{-1} \\\\\n &= \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n\\end{aligned}\n$$\nwhich by the continuous mapping theorem is consistent, $\\widehat{\\mb{V}}_{\\bfbeta} \\inprob \\mb{V}_{\\bfbeta}$. \n\nThis estimator is sometimes called the **robust variance estimator** or, more accurately, the **heteroskedasticity-consistent (HC) variance estimator**. Why is it robust? Consider the standard **homoskedasticity** assumption that most statistical software packages make when estimating OLS variances: the variance of the errors does not depend on the covariates, or $\\V[e_{i}^{2} \\mid \\X_{i}] = \\V[e_{i}^{2}]$. This assumption is stronger than needed, and we can rely on a weaker assumption that the squared errors are uncorrelated with a specific function of the covariates: \n$$ \n\\E[e_{i}^{2}\\X_{i}\\X_{i}'] = \\E[e_{i}^{2}]\\E[\\X_{i}\\X_{i}'] = \\sigma^{2}\\mb{Q}_{\\X\\X}, \n$$\nwhere $\\sigma^2$ is the variance of the residuals (since $\\E[e_{i}] = 0$). Homoskedasticity simplifies the asymptotic variance of the stabilized estimator, $\\sqrt{n}(\\bhat - \\bfbeta)$, to\n$$ \n\\mb{V}^{\\texttt{lm}}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\sigma^{2}\\mb{Q}_{\\X\\X}\\mb{Q}_{\\X\\X}^{-1} = \\sigma^2\\mb{Q}_{\\X\\X}^{-1}.\n$$\nWe already have an estimator for $\\mb{Q}_{\\X\\X}$, but we need one for $\\sigma^2$. We can easily use the SSR,\n$$ \n\\widehat{\\sigma}^{2} = \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\widehat{e}_{i}^{2},\n$$\nwhere we use $n-k-1$ in the denominator instead of $n$ to correct for the residuals being slightly less variable than the actual errors (because OLS mechanically attempts to make the residuals small). For consistent variance estimation, $n-k -1$ or $n$ can be used, since either way $\\widehat{\\sigma}^2 \\inprob \\sigma^2$. Thus, under homoskedasticity, we have\n$$ \n\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}} = \\widehat{\\sigma}^{2}\\left(\\frac{1}{n}\\Xmat'\\Xmat\\right)^{{-1}} = n\\widehat{\\sigma}^{2}\\left(\\Xmat'\\Xmat\\right)^{{-1}},\n$$\nThis is the standard variance estimator used by `lm()` in R and `reg` in Stata. \n\n\nHow do these two estimators, $\\widehat{\\mb{V}}_{\\bfbeta}$ and $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$, compare? Notice that the HC variance estimator and the homoskedasticity variance estimator will both be consistent when homoskedasticity holds. But as the \"heteroskedasticity-consistent\" label implies, only the HC variance estimator will be consistent when homoskedasticity fails to hold. So $\\widehat{\\mb{V}}_{\\bfbeta}$ has the advantage of being consistent regardless of the homoskedasticity assumption. This advantage comes at a cost, however. When homoskedasticity is correct, $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$ incorporates that assumption into the estimator whereas the HC variance estimator has to estimate it. The HC estimator will therefore have higher variance (the variance estimator will be more variable!) when homoskedasticity actually does hold. \n\n\n\n\n\nNow that we have established the asymptotic normality of the OLS estimator and developed a consistent estimator of its variance, we can proceed with all of the statistical inference tools we discussed in Part I, including hypothesis tests and confidence intervals. \n\nWe begin by defining the estimated **heteroskedasticity-consistent standard errors** as\n$$ \n\\widehat{\\se}(\\widehat{\\beta}_{j}) = \\sqrt{\\frac{[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}}{n}},\n$$\nwhere $[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}$ is the $j$th diagonal entry of the HC variance estimator. Note that we divide by $\\sqrt{n}$ here because $\\widehat{\\mb{V}}_{\\bfbeta}$ is a consistent estimator of the stabilized estimator $\\sqrt{n}(\\bhat - \\bfbeta)$ not the estimator itself. \n\nHypothesis tests and confidence intervals for individual coefficients are almost precisely the same as with the most general case presented in Part I. For a two-sided test of $H_0: \\beta_j = b$ versus $H_1: \\beta_j \\neq b$, we can build the t-statistic and conclude that, under the null,\n$$\n\\frac{\\widehat{\\beta}_j - b}{\\widehat{\\se}(\\widehat{\\beta}_{j})} \\indist \\N(0, 1).\n$$\nStatistical software will typically and helpfully provide the t-statistic for the null hypothesis of no (partial) linear relationship between $X_{ij}$ and $Y_i$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nwhich measures how large the estimated coefficient is in standard errors. With $\\alpha = 0.05$, asymptotic normality would imply that we reject this null when $t > 1.96$. We can form asymptotically-valid confidence intervals with \n$$ \n\\left[\\widehat{\\beta}_{j} - z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j}),\\;\\widehat{\\beta}_{j} + z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j})\\right]. \n$$\nFor reasons we will discuss below, standard software typically relies on the $t$ distribution instead of the normal for hypothesis testing and confidence intervals. Still, this difference is of little consequence in large samples. \n\n## Inference for multiple parameters\n\nWith multiple coefficients, we might have hypotheses that involve more than one coefficient. As an example, consider a regression with an interaction between two covariates, \n$$\nY_i = \\beta_0 + X_i\\beta_1 + Z_i\\beta_2 + X_iZ_i\\beta_3 + e_i.\n$$\nSuppose we wanted to test the hypothesis that $X_i$ does not affect the best linear predictor for $Y_i$. That would be\n$$ \nH_{0}: \\beta_{1} = 0 \\text{ and } \\beta_{3} = 0\\quad\\text{vs}\\quad H_{1}: \\beta_{1} \\neq 0 \\text{ or } \\beta_{3} \\neq 0,\n$$\nwhere we usually write the null more compactly as $H_0: \\beta_1 = \\beta_3 = 0$. \n\nTo test this null hypothesis, we need a test statistic that discriminates between the two hypotheses: it should be large when the alternative is true and small enough when the null is true. With a single coefficient, we usually test the null hypothesis of $H_0: \\beta_j = b_0$ with the $t$-statistic, \n$$ \nt = \\frac{\\widehat{\\beta}_{j} - b_{0}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nand we usually take the absolute value, $|t|$, as our measure of how extreme our estimate is given the null distribution. But notice that we could also use the square of the $t$ statistic, which is\n$$ \nt^{2} = \\frac{\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{\\V[\\widehat{\\beta}_{j}]} = \\frac{n\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{[\\mb{V}_{\\bfbeta}]_{[jj]}}. \n$$ {#eq-squared-t}\n\nWhile $|t|$ is the usual test statistic we use for two-sided tests, we could equivalently use $t^2$ and arrive at the exact same conclusions (as long as we knew the distribution of $t^2$ under the null hypothesis). It turns out that the $t^2$ version of the test statistic will generalize more easily to comparing multiple coefficients. This version of the test statistic suggests another general way to differentiate the null from the alternative: by taking the squared distance between them and dividing by the variance of the estimate. \n\nCan we generalize this idea to hypotheses about multiple parameters? Adding the sum of squared distances for each component of the null hypothesis is straightforward. For our interaction example, that would be\n$$ \n\\widehat{\\beta}_1^2 + \\widehat{\\beta}_3^2, \n$$\nRemember, however, that some of the estimated coefficients are noisier than others, so we should account for the uncertainty just like we did for the $t$-statistic. \n\nWith multiple parameters and multiple coefficients, the variances will now require matrix algebra. We can write any hypothesis about linear functions of the coefficients as $H_{0}: \\mb{L}\\bfbeta = \\mb{c}$. For example, in the interaction case, we have\n$$ \n\\mb{L} =\n\\begin{pmatrix}\n 0 & 1 & 0 & 0 \\\\\n 0 & 0 & 0 & 1 \\\\\n\\end{pmatrix}\n\\qquad\n\\mb{c} =\n\\begin{pmatrix}\n 0 \\\\\n 0\n\\end{pmatrix}\n$$\nThus, $\\mb{L}\\bfbeta = \\mb{0}$ is equivalent to $\\beta_1 = 0$ and $\\beta_3 = 0$. Notice that with other $\\mb{L}$ matrices, we could represent more complicated hypotheses like $2\\beta_1 - \\beta_2 = 34$, though we mostly stick to simpler functions. Let $\\widehat{\\bs{\\theta}} = \\mb{L}\\bhat$ be the OLS estimate of the function of the coefficients. By the delta method (discussed in @sec-delta-method), we have\n$$ \n\\sqrt{n}\\left(\\mb{L}\\bhat - \\mb{L}\\bfbeta\\right) \\indist \\N(0, \\mb{L}\\mb{V}_{\\bfbeta}\\mb{L}').\n$$\nWe can now generalize the squared $t$ statistic in @eq-squared-t by taking the distances $\\mb{L}\\bhat - \\mb{c}$ weighted by the variance-covariance matrix $\\mb{L}\\mb{V}_{\\bfbeta}\\mb{L}'$, \n$$ \nW = n(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}\\mb{V}_{\\bfbeta}\\mb{L}')^{-1}(\\mb{L}\\bhat - \\mb{c}),\n$$\nwhich is called the **Wald test statistic**. This statistic generalizes the ideas of the t-statistic to multiple parameters. With the t-statistic, we recenter to have mean 0 and divide by the standard error to get a variance of 1. If we ignore the middle variance weighting, we have $(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}\\bhat - \\mb{c})$ which is just the sum of the squared deviations of the estimates from the null. Including the $(\\mb{L}\\mb{V}_{\\bfbeta}\\mb{L}')^{-1}$ weight has the effect of rescaling the distribution of $\\mb{L}\\bhat - \\mb{c}$ to make it rotationally symmetric around 0 (so the resulting dimensions are uncorrelated) with each dimension having an equal variance of 1. In this way, the Wald statistic transforms the random vectors to be mean-centered and have variance 1 (just the t-statistic), but also to have the resulting random variables in the vector be uncorrelated.[^norms]\n\n\n[^norms]: The form of the Wald statistic is that of a weighted inner product, $\\mb{x}'\\mb{Ay}$, where $\\mb{A}$ is a symmetric positive-definite weighting matrix. \n\nWhy transform the data in this way? @fig-wald shows the contour plot of a hypothetical joint distribution of two coefficients from an OLS regression. We might want to know the distance between different points in the distribution and the mean, which in this case is $(1, 2)$. Without considering the joint distribution, the circle is obviously closer to the mean than the triangle. However, looking at the two points on the distribution, the circle is at a lower contour than the triangle, meaning it is more extreme than the triangle for this particular distribution. The Wald statistic, then, takes into consideration how much of a \"climb\" it is for $\\mb{L}\\bhat$ to get to $\\mb{c}$ given the distribution of $\\mb{L}\\bhat$.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of two slope coefficients. The circle is closer to the center of the distribution by the standard Euclidean distance, but the triangle is closer once you consider the joint distribution.](ols_properties_files/figure-pdf/fig-wald-1.pdf){#fig-wald}\n:::\n:::\n\n\n\n\nIf $\\mb{L}$ only has one row, our Wald statistic is the same as the squared $t$ statistic, $W = t^2$. This fact will help us think about the asymptotic distribution of $W$. Note that as $n\\to\\infty$, we know that by the asymptotic normality of $\\bhat$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{\\widehat{\\se}[\\widehat{\\beta}_{j}]} \\indist \\N(0,1)\n$$\nso $t^2$ will converge in distribution to a $\\chi^2_1$ (since a $\\chi^2_1$ distribution is just one standard normal distribution squared). After recentering and rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\\mb{L}\\bhat = \\mb{c}$, we have $W \\indist \\chi^2_{q}$. \n\n\nWe need to define the rejection region to use the Wald statistic in a hypothesis test. Because we are squaring each distance in $W \\geq 0$, larger values of $W$ indicate more disagreement with the null in either direction. Thus, for an $\\alpha$-level test of the joint null, we only need a one-sided rejection region of the form $\\P(W > w_{\\alpha}) = \\alpha$. Obtaining these values is straightforward (see the above callout tip). For $q = 2$ and a $\\alpha = 0.05$, the critical value is roughly 6. \n\n\n\n::: {.callout-note}\n\n## Chi-squared critical values\n\nWe can obtain critical values for the $\\chi^2_q$ distribution using the `qchisq()` function in R. For example, if we wanted to obtain the critical value $w$ such that $\\P(W > w_{\\alpha}) = \\alpha$ for our two-parameter interaction example, we could use:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqchisq(p = 0.95, df = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5.991465\n```\n:::\n:::\n\n\n:::\n\n\nThe Wald statistic is not a common test provided by standard statistical software functions like `lm()` in R, though it is fairly straightforward to implement \"by hand.\" Alternatively, packages like [`{aod}`](https://cran.r-project.org/web/packages/aod/index.html) or [`{clubSandwich}`](http://jepusto.github.io/clubSandwich/) have implementations of the test. What is reported by most software implementations of OLS (like `lm()` in R) is the F-statistic, which is\n$$ \nF = \\frac{W}{q}.\n$$\nThis also typically uses the homoskedastic variance estimator $\\mb{V}^{\\texttt{lm}}_{\\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution has no justification in statistical theory, but it is slightly more conservative than the $\\chi^2_q$ distribution, and the inferences from the $F$ statistic will converge to those from the $\\chi^2_q$ distribution as $n\\to\\infty$. So it might be justified as an *ad hoc* small-sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and we have, say, a sample size of $n = 100$, then in that case, the critical value for the F test with $\\alpha = 0.05$ is\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqf(0.95, df1 = 2, df2 = 100 - 4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.091191\n```\n:::\n:::\n\n\n\nThis result implies a critical value of 6.182 on the scale of the Wald statistic (multiplying it by $q = 2$). Compared to the earlier critical value of 5.991 based on the $\\chi^2_2$ distribution, we can see that the inferences will be very similar even in moderately-sized datasets. \n\nFinally, note that the F-statistic reported by `lm()` in R is the test of all the coefficients being equal to 0 jointly except for the intercept. In modern quantitative social sciences, this test is seldom substantively interesting. \n\n\n## Finite-sample properties with a linear CEF\n\nAll the above results have been large-sample properties, and we have not addressed finite-sample properties like the sampling variance or unbiasedness. Under the linear projection assumption above, OLS is generally biased without stronger assumptions. This section introduces the stronger assumption that will allow us to establish stronger properties for OLS. As usual, however, remember that these stronger assumptions can be wrong. \n\n\n::: {.callout-note}\n## Assumption: Linear Regression Model\n1. The variables $(Y_{i}, \\X_{i})$ satisfy the linear CEF assumption.\n$$ \n\\begin{aligned}\n Y_{i} &= \\X_{i}'\\bfbeta + e_{i} \\\\\n \\E[e_{i}\\mid \\X_{i}] & = 0.\n\\end{aligned}\n$$\n\n2. The design matrix is invertible $\\E[\\X_{i}\\X_{i}'] > 0$ (positive definite).\n:::\n\n\nWe discussed the concept of a linear CEF extensively in @sec-regression. However, recall that the CEF might be linear mechanically if the model is **saturated** or when there are as many coefficients in the model as there are unique values of $\\X_i$. When a model is not saturated, the linear CEF assumption is just that: an assumption. What can this assumption do? It can aid in establishing some nice statistical properties in finite samples. \n\nBefore proceeding, note that, when focusing on the finite sample inference for OLS, we focused on its properties **conditional on the observed covariates**, such as $\\E[\\bhat \\mid \\Xmat]$ or $\\V[\\bhat \\mid \\Xmat]$. The historical reason for this is that the researcher often chose these independent variables and so they were not random. Thus, sometimes $\\Xmat$ is treated as \"fixed\" in some older texts, which might even omit explicit conditioning statements. \n\n\n::: {#thm-ols-unbiased}\n\nUnder the linear regression model assumption, OLS is unbiased for the population regression coefficients, \n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta,\n$$\nand its conditional sampling variance is\n$$\n\\mb{\\V}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nwhere $\\sigma^2_{i} = \\E[e_{i}^{2} \\mid \\Xmat]$. \n:::\n\n\n::: {.proof}\n\nTo prove the conditional unbiasedness, recall that we can write the OLS estimator as\n$$\n\\bhat = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e},\n$$\nand so taking (conditional) expectations, we have\n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta + \\E[(\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\E[\\mb{e} \\mid \\Xmat] = \\bfbeta,\n$$\nbecause under the linear CEF assumption $\\E[\\mb{e}\\mid \\Xmat] = 0$. \n\nFor the conditional sampling variance, we can use the same decomposition we have,\n$$\n\\V[\\bhat \\mid \\Xmat] = \\V[\\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = (\\Xmat'\\Xmat)^{-1}\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat(\\Xmat'\\Xmat)^{-1}. \n$$\nSince $\\E[\\mb{e}\\mid \\Xmat] = 0$, we know that $\\V[\\mb{e}\\mid \\Xmat] = \\E[\\mb{ee}' \\mid \\Xmat]$, which is a matrix with diagonal entries $\\E[e_{i}^{2} \\mid \\Xmat] = \\sigma^2_i$ and off-diagonal entries $\\E[e_{i}e_{j} \\Xmat] = \\E[e_{i}\\mid \\Xmat]\\E[e_{j}\\mid\\Xmat] = 0$, where the first equality follows from the independence of the errors across units. Thus, $\\V[\\mb{e} \\mid \\Xmat]$ is a diagonal matrix with $\\sigma^2_i$ along the diagonal, which means\n$$\n\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat = \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i',\n$$\nestablishing the conditional sampling variance.\n \n:::\n\nThis means that, for any realization of the covariates, $\\Xmat$, OLS is unbiased for the true regression coefficients $\\bfbeta$. By the law of iterated expectation, we also know that it is unconditionally unbiased[^unconditional] as well since\n$$\n\\E[\\bhat] = \\E[\\E[\\bhat \\mid \\Xmat]] = \\bfbeta. \n$$\nThe difference between these two statements usually isn't incredibly meaningful. \n\n[^unconditional]: We are basically ignoring some edge cases when it comes to discrete covariates here. In particular, we assume that $\\Xmat'\\Xmat$ is nonsingular with probability one. However, this assumption can fail if we have a binary covariate since there is some chance (however slight) that the entire column will be all ones or all zeros, which would lead to a singular matrix $\\Xmat'\\Xmat$. Practically this is not a big deal, but it does mean that we have to ignore this issue theoretically or focus on conditional unbiasedness. \n\n\nThere are a lot of variances flying around, so reviewing them is helpful. Above, we derived the asymptotic variance of $\\mb{Z}_{n} = \\sqrt{n}(\\bhat - \\bfbeta)$, \n$$\n\\mb{V}_{\\bfbeta} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1},\n$$\nwhich implies that the approximate variance of $\\bhat$ will be $\\mb{V}_{\\bfbeta} / n$ because\n$$\n\\bhat = \\frac{Z_n}{\\sqrt{n}} + \\bfbeta \\quad\\implies\\quad \\bhat \\overset{a}{\\sim} \\N(\\bfbeta, n^{-1}\\mb{V}_{\\bfbeta}),\n$$\nwhere $\\overset{a}{\\sim}$ means asymptotically distributed as. Under the linear CEF, the conditional sampling variance of $\\bhat$ has a similar form and will be similar to the \n$$\n\\mb{V}_{\\bhat} = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\approx \\mb{V}_{\\bfbeta} / n.\n$$\nIn practice, these two derivations lead to basically the same variance estimator. Recall that the heteroskedastic-consistent variance estimator\n$$\n\\widehat{\\mb{V}}_{\\bfbeta} = \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n$$\nis a valid plug-in estimator for the asymptotic variance and\n$$\n\\widehat{\\mb{V}}_{\\bhat} = n^{-1}\\widehat{\\mb{V}}_{\\bfbeta}.\n$$\nThus, in practice, the asymptotic and finite-sample results under a linear CEF justify the same variance estimator. \n\n\n### Linear CEF model under homoskedasticity\n\nIf we are willing to assume that the standard errors are homoskedastic, we can derive even stronger results for OLS. Stronger assumptions typically lead to stronger conclusions, but, obviously, those conclusions may not be robust to assumption violations. But homoskedasticity of errors is such a historically important assumption that statistical software implementations of OLS like `lm()` in R assume it by default. \n\n::: {.callout-note}\n\n## Assumption: Homoskedasticity with a linear CEF\n\nIn addition to the linear CEF assumption, we further assume that\n$$\n\\E[e_i^2 \\mid \\X_i] = \\E[e_i^2] = \\sigma^2,\n$$\nor that variance of the errors does not depend on the covariates. \n:::\n\n\n::: {#thm-homoskedasticity}\n\nUnder a linear CEF model with homoskedastic errors, the conditional sampling variance is\n$$\n\\mb{V}^{\\texttt{lm}}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\sigma^2 \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nand the variance estimator \n$$\n\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} = \\widehat{\\sigma}^2 \\left( \\Xmat'\\Xmat \\right)^{-1} \\quad\\text{where,}\\quad \\widehat{\\sigma}^2 = \\frac{1}{n - k - 1} \\sum_{i=1}^n \\widehat{e}_i^2\n$$\nis unbiased, $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n::: \n\n::: {.proof}\nUnder homoskedasticity $\\sigma^2_i = \\sigma^2$ for all $i$. Recall that $\\sum_{i=1}^n \\X_i\\X_i' = \\Xmat'\\Xmat$. Thus, the conditional sampling variance from @thm-ols-unbiased, \n$$ \n\\begin{aligned}\n\\V[\\bhat \\mid \\Xmat] &= \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2 \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\ &= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\Xmat'\\Xmat \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1} = \\mb{V}^{\\texttt{lm}}_{\\bhat}.\n\\end{aligned}\n$$\n\nFor unbiasedness, we just need to show that $\\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] = \\sigma^2$. Recall that we defined $\\mb{M}_{\\Xmat}$ as the residual-maker because $\\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}}$. We can use this to connect the residuals to the standard errors,\n$$ \n\\mb{M}_{\\Xmat}\\mb{e} = \\mb{M}_{\\Xmat}\\mb{Y} - \\mb{M}_{\\Xmat}\\Xmat\\bfbeta = \\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}},\n$$ \nso \n$$\n\\V[\\widehat{\\mb{e}} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\V[\\mb{e} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\sigma^2,\n$$\nwhere the first equality holds because $\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'$ is constant conditional on $\\Xmat$. Notice that the diagonal entries of this matrix are the variances of particular residuals $\\widehat{e}_i$ and that the diagonal entries of the annihilator matrix are $1 - h_{ii}$ (since the $h_{ii}$ are the diagonal entries of $\\mb{P}_{\\Xmat}$). Thus, we have\n$$ \n\\V[\\widehat{e}_i \\mid \\Xmat] = \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] = (1 - h_{ii})\\sigma^{2}.\n$$\nIn the last chapter in @sec-leverage, we established that one property of these leverage values is $\\sum_{i=1}^n h_{ii} = k+ 1$, so $\\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have\n$$ \n\\begin{aligned}\n \\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] &= \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] \\\\\n &= \\frac{\\sigma^{2}}{n-k-1} \\sum_{i=1}^{n} 1 - h_{ii} \\\\\n &= \\sigma^{2}. \n\\end{aligned}\n$$\nThis establishes $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n\n:::\n\n\nThus, under the linear CEF model and homoskedasticity of the errors, we have an unbiased variance estimator that is a simple function of the sum of squared residuals and the design matrix. Most statistical software packages estimate standard errors using $\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}$. \n\n\nThe final result we can derive for the linear CEF under the homoskedasticity assumption is an optimality result. That is, we might ask if there is another estimator for $\\bfbeta$ that would outperform OLS in the sense of having a lower sampling variance. Perhaps surprisingly, no linear estimator for $\\bfbeta$ has a lower conditional variance, meaning that OLS is the **best linear unbiased estimator**, often jovially shortened to BLUE. This result is famously known as the Gauss-Markov Theorem.\n\n::: {#thm-gauss-markov}\n\nLet $\\widetilde{\\bfbeta} = \\mb{AY}$ be a linear and unbiased estimator for $\\bfbeta$. Under the linear CEF model with homoskedastic errors, \n$$\n\\V[\\widetilde{\\bfbeta}\\mid \\Xmat] \\geq \\V[\\bhat \\mid \\Xmat]. \n$$\n\n:::\n\n::: {.proof}\nNote that if $\\widetilde{\\bfbeta}$ is unbiased then $\\E[\\widetilde{\\bfbeta} \\mid \\Xmat] = \\bfbeta$ and so \n$$\n\\bfbeta = \\E[\\mb{AY} \\mid \\Xmat] = \\mb{A}\\E[\\mb{Y} \\mid \\Xmat] = \\mb{A}\\Xmat\\bfbeta,\n$$\nwhich implies that $\\mb{A}\\Xmat = \\mb{I}_n$. \nRewrite the competitor as $\\widetilde{\\bfbeta} = \\bhat + \\mb{BY}$ where,\n$$ \n\\mb{B} = \\mb{A} - \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'.\n$$\nand note that $\\mb{A}\\Xmat = \\mb{I}_n$ implies that $\\mb{B}\\Xmat = 0$. We now have\n$$ \n\\begin{aligned}\n \\widetilde{\\bfbeta} &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{Y} \\\\\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\mb{B}\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e}\n\\end{aligned}\n$$\nThe variance of the competitor is, thus, \n$$ \n\\begin{aligned}\n \\V[\\widetilde{\\bfbeta} \\mid \\Xmat]\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\V[\\mb{e}\\mid \\Xmat]\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)' \\\\\n &= \\sigma^{2}\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\left( \\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{B}'\\right) \\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\mb{B}' + \\mb{B}\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &\\geq \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1} \\\\\n &= \\V[\\bhat \\mid \\Xmat]\n\\end{aligned}\n$$\nThe first equality comes from the properties of covariance matrices, the second is due to the homoskedasticity assumption, and the fourth is due to $\\mb{B}\\Xmat = 0$, which implies that $\\Xmat'\\mb{B}' = 0$ as well. The fifth inequality holds because matrix products of the form $\\mb{BB}'$ are positive definite if $\\mb{B}$ is of full rank (which we have assumed it is). \n\n:::\n\nIn this proof, we saw that the variance of the competing estimator had variance $\\sigma^2\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)$ which we argued was \"greater than 0\" in the matrix sense, which is also called positive definite. What does this mean practically? Remember that any positive definite matrix must have strictly positive diagonal entries and that the diagonal entries of $\\V[\\bhat \\mid \\Xmat]$ and $V[\\widetilde{\\bfbeta}\\mid \\Xmat]$ are the variances of the individual parameters, $\\V[\\widehat{\\beta}_{j} \\mid \\Xmat]$ and $\\V[\\widetilde{\\beta}_{j} \\mid \\Xmat]$. Thus, the variances of the individual parameters will be larger for $\\widetilde{\\bfbeta}$ than for $\\bhat$.\n\nMany textbooks cite the Gauss-Markov theorem as a critical advantage of OLS over other methods, but recognizing its limitations is essential. It requires linearity and homoskedastic error assumptions, and these can be false in many applications. \n\nFinally, note that while we have shown this result for linear estimators, @Hansen22 proves a more general version of this result that applies to any unbiased estimator. \n\n## The normal linear model\n\nFinally, we add the strongest and thus least loved of the classical linear regression assumption: (conditional) normality of the errors. Historically the reason to use this assumption was that finite-sample inference hits a roadblock without some knowledge of the sampling distribution of $\\bhat$. Under the linear CEF model, we saw that $\\bhat$ is unbiased, and under homoskedasticity, we could produce an unbiased estimator of the conditional variance. But for hypothesis testing or for generating confidence intervals, we need to make probability statements about the estimator, and, for that, we need to know its exact distribution. When the sample size is large, we can rely on the CLT and know $\\bhat$ is approximately normal. But how do we proceed in small samples? Historically we would have assumed (conditional) normality of the errors, basically proceeding with some knowledge that we were wrong but hopefully not too wrong. \n\n\n::: {.callout-note}\n\n## The normal linear regression model\n\nIn addition to the linear CEF assumption, we assume that \n$$\ne_i \\mid \\Xmat \\sim \\N(0, \\sigma^2).\n$$\n\n:::\n\nThere are a couple of important points: \n\n- The assumption here is not that $(Y_{i}, \\X_{i})$ are jointly normal (though this would be sufficient for the assumption to hold), but rather that $Y_i$ is normally distributed conditional on $\\X_i$. \n- Notice that the normal regression model has the homoskedasticity assumption baked in. \n\n::: {#thm-normal-ols}\n\nUnder the normal linear regression model, we have\n$$ \n\\begin{aligned}\n \\bhat \\mid \\Xmat &\\sim \\N\\left(\\bfbeta, \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1}\\right) \\\\\n \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}]_{jj}/\\sqrt{n}} &\\sim t_{n-k-1} \\\\\n W/q &\\sim F_{q, n-k-1}. \n\\end{aligned}\n$$\n\n:::\n\n\nThis theorem says that in the normal linear regression model, the coefficients follow a normal distribution, the t-statistics follow a $t$-distribution, and a transformation of the Wald statistic follows an $F$ distribution. These are **exact** results and do not rely on large-sample approximations. Under the assumption of conditional normality of the errors, the results are as valid for $n = 5$ as for $n = 500,000$. \n\nFew people believe errors follow a normal distribution, so why even present these results? Unfortunately, most statistical software implementations of OLS implicitly assume this when calculating p-values for tests or constructing confidence intervals. In R, for example, the p-value associated with the $t$-statistic reported by `lm()` relies on the $t_{n-k-1}$ distribution, and the critical values used to construct confidence intervals with `confint()` use that distribution as well. When normality does not hold, there is no principled reason to use the $t$ or the $F$ distributions in this way. But we might hold our nose and use this *ad hoc* procedure under two rationalizations:\n\n- $\\bhat$ is asymptotically normal. This approximation might, however, be poor in smaller finite samples. The $t$ distribution will make inference more conservative in these cases (wider confidence intervals, smaller test rejection regions), which might help offset its poor approximation of the normal distribution in small samples. \n- As $n\\to\\infty$, the $t_{n-k-1}$ will converge to a standard normal distribution, so the *ad hoc* adjustment will not matter much for medium to large samples. \n\nThese arguments are not very convincing since whether the $t$ approximation will be any better than the normal in finite samples is unclear. But it may be the best we can do while we go and find more data. \n\n## Summary\n\nIn this chapter, we discussed the large-sample properties of OLS, which are quite strong. Under mild conditions, OLS is consistent for the population linear regression coefficients and is asymptotically normal. The variance of the OLS estimator, and thus the variance estimator, depends on whether the projection errors are assumed to be unrelated to the covariates (**homoskedastic**) or possibly related (**heteroskedastic**). Confidence intervals and hypothesis tests for individual OLS coefficients are largely the same as discussed in Part I of this book, and we can obtain finite-sample properties of OLS such as conditional unbiasedness if we assume the conditional expectation function is linear. If we further assume the errors are normally distributed, we can derive confidence intervals and hypothesis tests that are valid for all sample sizes. \n", "supporting": [ "ols_properties_files/figure-pdf" ], diff --git a/_freeze/ols_properties/figure-pdf/fig-wald-1.pdf b/_freeze/ols_properties/figure-pdf/fig-wald-1.pdf index fa2102a92efb4a210923b931f4586c65a60e2d4d..77d260ea4d7ef42b4f525957f4d9baccf30714ba 100644 GIT binary patch delta 206 zcmaEPf%)YH<_+FARLzY{4GoNpjmE?1CTOl}V4`kdppHkM`DFEHkpe~v8X>6>3O<=-sR}@24JS`{CeCQQdBw9NMkfO^ s6H5bgM>A7PR~I)&X9E)hHz#8YR|^9}BTEZ&6C*nX8v;ru`@E0^0P_Vnd;kCd delta 206 zcmaEPf%)YH<_+FARLu-54b4qWO)WLK^nLSFToOxC6*OF|j0}uS4B&E`3vMLyGMY}l z@>E?1E~sl@qHbWIj!)lY^=FX+rV1J%sSyf3nPsU8Kx2(3Pk1KIXu5gDvm{0*V-pK! qM{\raggedright\arraybackslash}p{(\columnwidth - 4\tabcolsep) * \real{0.2466}} >{\raggedright\arraybackslash}p{(\columnwidth - 4\tabcolsep) * \real{0.3151}} >{\raggedright\arraybackslash}p{(\columnwidth - 4\tabcolsep) * \real{0.4384}}@{}} -\caption{All possible simple random samples of size 2 from the hobbit -population}\label{tbl-hobbit-samples}\tabularnewline +\caption{\label{tbl-hobbit-samples}All possible simple random samples of +size 2 from the hobbit population}\tabularnewline \toprule\noalign{} \begin{minipage}[b]{\linewidth}\raggedright Sample (\(j\)) @@ -898,10 +900,11 @@ \section{Question 4: Estimator}\label{question-4-estimator} 6 (Pip, Merry) & 1/6 & (123 + 127) / 2 = 125 \\ \end{longtable} +\hypertarget{tbl-sampling-dist}{} \begin{longtable}[]{@{}ll@{}} -\caption{Sampling distribution of the sample mean for simple random -samples of size 2 from the hobbit -population}\label{tbl-sampling-dist}\tabularnewline +\caption{\label{tbl-sampling-dist}Sampling distribution of the sample +mean for simple random samples of size 2 from the hobbit +population}\tabularnewline \toprule\noalign{} Sample mean & Probability \\ \midrule\noalign{} @@ -922,7 +925,7 @@ \section{Question 4: Estimator}\label{question-4-estimator} more or less likely and depends on both the population distribution and the sampling design. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] Notice that the sampling distribution of an estimator will depend on the sampling design. Here, we used a simple random sample. Bernoulli @@ -933,8 +936,9 @@ \section{Question 4: Estimator}\label{question-4-estimator} \end{tcolorbox} +\hypertarget{properties-of-the-sampling-distribution-of-an-estimator}{% \subsection{Properties of the sampling distribution of an -estimator}\label{properties-of-the-sampling-distribution-of-an-estimator} +estimator}\label{properties-of-the-sampling-distribution-of-an-estimator}} Generally speaking, we want ``good'' estimators. But what makes an estimator ``good'\,'? The best estimator would obviously be the one that @@ -985,7 +989,7 @@ \subsection{Properties of the sampling distribution of an \] The two are the same, meaning the sample mean in this simple random sample is unbiased. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, breakable, colbacktitle=quarto-callout-warning-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-warning-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-warning-color-frame] Note that the word ``bias'' sometimes also refers to research that is systematically incorrect in other ways. For example, we might complain @@ -1056,7 +1060,8 @@ \subsection{Properties of the sampling distribution of an An estimator can be very precise, but the same estimator can be inaccurate because it is biased. -\section{Question 5: Uncertainty}\label{question-5-uncertainty} +\hypertarget{question-5-uncertainty}{% +\section{Question 5: Uncertainty}\label{question-5-uncertainty}} We now have a population, a quantity of interest, a sampling design, an estimator, and, with data, an actual estimate. But if we sampled, say, @@ -1093,8 +1098,9 @@ \section{Question 5: Uncertainty}\label{question-5-uncertainty} larger the population size, \(N\), the smaller the sampling variance (again for a fixed sample size). +\hypertarget{deriving-the-sampling-variance-of-the-sample-mean}{% \subsection{Deriving the sampling variance of the sample -mean}\label{deriving-the-sampling-variance-of-the-sample-mean} +mean}\label{deriving-the-sampling-variance-of-the-sample-mean}} How did we obtain this expression for the sampling variance under simple random sampling? It would be tempting to simply say ``someone else @@ -1178,8 +1184,9 @@ \subsection{Deriving the sampling variance of the sample The next chapter will discuss how the variance of the sample mean under independent and identically distributed sampling is much simpler. +\hypertarget{estimating-the-sampling-variance}{% \subsection{Estimating the sampling -variance}\label{estimating-the-sampling-variance} +variance}\label{estimating-the-sampling-variance}} An unfortunate aspect of the sampling variance, \(\V[\Xbar_n]\), is that it depends on the population variance, \(s^2\), which we cannot know @@ -1197,7 +1204,7 @@ \subsection{Estimating the sampling \widehat{\V}[\Xbar_n] = \left(1 - \frac{n}{N}\right)\frac{S^2}{n}. \] -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Mind your variances}, breakable, colbacktitle=quarto-callout-warning-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-warning-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Mind your variances}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-warning-color-frame] It is easy to get confused about the difference between the population variance, the variance of the sample, and the sampling variance (just as @@ -1241,8 +1248,9 @@ \subsection{Estimating the sampling \E\left[\widehat{\V}[\Xbar_n]\right] = \left(1 - \frac{n}{N}\right)\frac{\E\left[S^2\right]}{n} = \left(1 - \frac{n}{N}\right)\frac{s^2}{n} = \V[\Xbar_n], \] which establishes that the estimator is unbiased. +\hypertarget{stratified-sampling-and-survey-weights}{% \section{Stratified sampling and survey -weights}\label{stratified-sampling-and-survey-weights} +weights}\label{stratified-sampling-and-survey-weights}} True to its name, the simple random sample is perhaps the most straightforward way to take a random sample of a fixed size. With more @@ -1374,7 +1382,8 @@ \section{Stratified sampling and survey considered the better estimator in many situations, though, because it has lower sampling variance than the HT estimator. -\subsection{Sampling weights}\label{sampling-weights} +\hypertarget{sampling-weights}{% +\subsection{Sampling weights}\label{sampling-weights}} The HT and Hajek estimators are both functions of what are commonly called the \textbf{sampling weights}, \[w_i = \frac{1}{\pi_i}\]. We can @@ -1408,7 +1417,8 @@ \subsection{Sampling weights}\label{sampling-weights} which is equivalent to the Hajek estimator above. -\section{Summary}\label{summary} +\hypertarget{summary}{% +\section{Summary}\label{summary}} This chapter covered the basic structure of design-based inference in the context of sampling from a population. We introduced the basic @@ -1424,9 +1434,11 @@ \section{Summary}\label{summary} design, choose a quantity of interest, select an estimator, and describe the uncertainty of any estimates. -\chapter{Model-based inference}\label{sec-model-based} +\hypertarget{sec-model-based}{% +\chapter{Model-based inference}\label{sec-model-based}} -\section{Introduction}\label{introduction-1} +\hypertarget{introduction-1}{% +\section{Introduction}\label{introduction-1}} Suppose you have been tasked with estimating the fraction of a population that supports increasing legal immigration limits. You have a @@ -1513,8 +1525,9 @@ \section{Introduction}\label{introduction-1} proofs helps us understand the arguments about novel estimators that we inevitably see over the course of our careers. +\hypertarget{probability-vs-inference-the-big-picture}{% \section{Probability vs inference: the big -picture}\label{probability-vs-inference-the-big-picture} +picture}\label{probability-vs-inference-the-big-picture}} Probability is the mathematical study of uncertain events and is the basis of the mathematical study of estimation. In probability, we assume @@ -1526,7 +1539,13 @@ \section{Probability vs inference: the big estimation, we use our observed data to make an \textbf{inference} about the data-generating process. -\includegraphics{assets/img/two-direction.png} +\begin{figure}[th] + +{\centering \includegraphics{assets/img/two-direction.png} + +} + +\end{figure} An estimator is a rule for converting our data into a best guess about some unknown quantity, such as the percent of balls in the urn, or, to @@ -1577,7 +1596,8 @@ \section{Probability vs inference: the big design-based inference, but adapt these to consider model-based inference. -\section{Question 1: Population}\label{question-1-population-1} +\hypertarget{question-1-population-1}{% +\section{Question 1: Population}\label{question-1-population-1}} The main advantage and disadvantage of relying on models is that they are abstract and theoretical, which means the connection between a model @@ -1602,8 +1622,9 @@ \section{Question 1: Population}\label{question-1-population-1} trying to learn about so that readers can evaluate how well the modeling assumptions fit that task. +\hypertarget{question-2-statistical-model}{% \section{Question 2: Statistical -model}\label{question-2-statistical-model} +model}\label{question-2-statistical-model}} Let's begin by building a bare-bones probability model for how our data came to be. As an example, suppose we have a data set with a series of @@ -1665,7 +1686,7 @@ \section{Question 2: Statistical \(F\) if \(\{X_1, \ldots, X_n\}\) is iid with distribution \(F\). The \textbf{sample size} \(n\) is the number of units in the sample. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] You might wonder why we reference the distribution of \(X_i\) with the cdf, \(F\). Mathematical statistics tends to do this to avoid having to @@ -1747,7 +1768,7 @@ \section{Question 2: Statistical \(F\) becomes the joint distribution of that random vector. Nothing substantive changes about the above discussion. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, breakable, colbacktitle=quarto-callout-warning-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-warning-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-warning-color-frame] Survey sampling is one of the most popular ways of obtaining samples from a population, but modern sampling practices rarely produce a @@ -1783,8 +1804,9 @@ \section{Question 2: Statistical \end{tcolorbox} +\hypertarget{question-3-quantities-of-interest}{% \section{Question 3: Quantities of -interest}\label{question-3-quantities-of-interest} +interest}\label{question-3-quantities-of-interest}} In model-based inference, our goal is to learn about the data-generating process. Each data point \(X_i\) represents a draw from a distribution, @@ -1850,14 +1872,15 @@ \section{Question 3: Quantities of estimation} describes how we obtain a single ``best guess'' about \(\theta\). -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] Some refer to quantities of interest as \textbf{parameters} or \textbf{estimands} (that is, the target of estimation). \end{tcolorbox} -\section{Question 4: Estimator}\label{question-4-estimator-1} +\hypertarget{question-4-estimator-1}{% +\section{Question 4: Estimator}\label{question-4-estimator-1}} Having a target in mind, we can estimate it with our data. To do so, we first need a rule or algorithm or function that takes as inputs the data @@ -1880,7 +1903,7 @@ \section{Question 4: Estimator}\label{question-4-estimator-1} \end{definition} -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] It is widespread, though not universal, to use the ``hat'' notation to define an estimator and its estimand. For example, \(\widehat{\theta}\) @@ -1913,7 +1936,13 @@ \section{Question 4: Estimator}\label{question-4-estimator-1} For example, here we illustrate two samples of size \(n =5\) from the population distribution of a binary variable: -\includegraphics{assets/img/sampling-distribution.png} +\begin{figure}[th] + +{\centering \includegraphics{assets/img/sampling-distribution.png} + +} + +\end{figure} We can see that the mean of the variable depends on what exact values end up in our sample. We refer to the distribution of @@ -1921,7 +1950,7 @@ \section{Question 4: Estimator}\label{question-4-estimator-1} distribution}. The sampling distribution of an estimator will be the basis for all of the formal statistical properties of an estimator. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, breakable, colbacktitle=quarto-callout-warning-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-warning-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-warning-color-frame] One important distinction of jargon is between an estimator and an estimate. The estimator is a function of the data, whereas the @@ -1938,7 +1967,8 @@ \section{Question 4: Estimator}\label{question-4-estimator-1} \end{tcolorbox} -\section{How to find estimators}\label{how-to-find-estimators} +\hypertarget{how-to-find-estimators}{% +\section{How to find estimators}\label{how-to-find-estimators}} Where do estimators come from? That may seem like a question reserved for statisticians or methodologists or others responsible for @@ -1949,8 +1979,9 @@ \section{How to find estimators}\label{how-to-find-estimators} models, before turning to the main focus of this book, plug-in estimators. +\hypertarget{parametric-models-and-maximum-likelihood}{% \subsection{Parametric models and maximum -likelihood}\label{parametric-models-and-maximum-likelihood} +likelihood}\label{parametric-models-and-maximum-likelihood}} The first method for generating estimators relies on \textbf{parametric models}, in which the researcher specifies the exact distribution (up to @@ -1978,7 +2009,7 @@ \subsection{Parametric models and maximum Binomial? The attractive properties of MLE are only as good as the ability to specify the parametric model. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{No free lunch}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{No free lunch}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] Building up intuition about the \textbf{assumptions-precision tradeoff} is essential. Researchers can usually get more precise estimates if they @@ -1988,7 +2019,8 @@ \subsection{Parametric models and maximum \end{tcolorbox} -\subsection{Plug-in estimators}\label{plug-in-estimators} +\hypertarget{plug-in-estimators}{% +\subsection{Plug-in estimators}\label{plug-in-estimators}} The second broad class of estimators is \textbf{semiparametric} in that we specify some finite-dimensional parameters of the DGP but leave the @@ -2062,7 +2094,7 @@ \subsection{Plug-in estimators}\label{plug-in-estimators} \end{example} -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Notation alert}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Notation alert}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] Given the connection between the population mean and the sample mean, you may see the \(\E_n[\cdot]\) operator used as a shorthand for the @@ -2083,8 +2115,9 @@ \subsection{Plug-in estimators}\label{plug-in-estimators} complete and technical treatment of these ideas, see Wasserman (2004) Chapter 7. +\hypertarget{the-three-distributions-population-empirical-and-sampling}{% \section{The three distributions: population, empirical, and -sampling}\label{the-three-distributions-population-empirical-and-sampling} +sampling}\label{the-three-distributions-population-empirical-and-sampling}} Once we start to wade into estimation, there are several distributions to keep track of, and things can quickly become confusing. Three @@ -2185,8 +2218,9 @@ \section{The three distributions: population, empirical, and \end{example} +\hypertarget{finite-sample-properties-of-estimators}{% \section{Finite-sample properties of -estimators}\label{finite-sample-properties-of-estimators} +estimators}\label{finite-sample-properties-of-estimators}} As discussed in our introduction to estimators, their usefulness depends on how well they help us learn about the quantity of interest. If we get @@ -2206,7 +2240,8 @@ \section{Finite-sample properties of the sampling distribution that are useful in comparing estimators. Note that the properties here will be very similar to the -\subsection{Bias}\label{bias} +\hypertarget{bias}{% +\subsection{Bias}\label{bias}} The first property of the sampling distribution concerns its central tendency. In particular, we define the \textbf{bias} (or @@ -2240,7 +2275,7 @@ \subsection{Bias}\label{bias} \end{example} -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, breakable, colbacktitle=quarto-callout-warning-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-warning-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-warning-color-frame] Properties like unbiasedness might only hold for a subset of DGPs. For example, we just showed that the sample mean is unbiased but only when @@ -2253,7 +2288,8 @@ \subsection{Bias}\label{bias} \end{tcolorbox} -\section{Question 5: Uncertainty}\label{question-5-uncertainty-1} +\hypertarget{question-5-uncertainty-1}{% +\section{Question 5: Uncertainty}\label{question-5-uncertainty-1}} The spread of the sampling distribution is also important. We define the \textbf{sampling variance} as the variance of an estimator's sampling @@ -2289,7 +2325,8 @@ \section{Question 5: Uncertainty}\label{question-5-uncertainty-1} Given the above derivation, the standard error of the sample mean under iid sampling is \(\sigma / \sqrt{n}\). -\subsection{Mean squared error}\label{mean-squared-error} +\hypertarget{mean-squared-error}{% +\subsection{Mean squared error}\label{mean-squared-error}} Bias and sampling variance measure two properties of ``good'' estimators because they capture the fact that we want the estimator to be as close @@ -2301,12 +2338,12 @@ \subsection{Mean squared error}\label{mean-squared-error} The MSE also relates to the bias and the sampling variance (provided it is finite) via the following decomposition result: -\begin{equation}\phantomsection\label{eq-mse-decomposition}{ +\begin{equation}\protect\hypertarget{eq-mse-decomposition}{}{ \text{MSE} = \text{bias}[\widehat{\theta}_n]^2 + \V[\widehat{\theta}_n] -}\end{equation} This decomposition implies that, for unbiased -estimators, MSE is the sampling variance. It also highlights why we -might accept some bias for significant reductions in variance for lower -overall MSE. +}\label{eq-mse-decomposition}\end{equation} This decomposition implies +that, for unbiased estimators, MSE is the sampling variance. It also +highlights why we might accept some bias for significant reductions in +variance for lower overall MSE. \begin{figure}[th] @@ -2316,7 +2353,7 @@ \subsection{Mean squared error}\label{mean-squared-error} \caption{Two sampling distributions} -\end{figure}% +\end{figure} This figure shows the sampling distributions of two estimators: (1) \(\widehat{\theta}_a\), which is unbiased (centered on the true value @@ -2328,7 +2365,8 @@ \subsection{Mean squared error}\label{mean-squared-error} bias and variance, and, indeed, in this case, \(MSE[\widehat{\theta}_b] < MSE[\widehat{\theta}_a]\). -\section{Summary}\label{summary-1} +\hypertarget{summary-1}{% +\section{Summary}\label{summary-1}} In this chapter, we introduced \textbf{model-based inference}, in which we posit a probability model for the data-generating process. These @@ -2361,9 +2399,11 @@ \section{Summary}\label{summary-1} no matter the sample size). In the next chapter, we will derive even more powerful results using large-sample approximations. -\chapter{Asymptotics}\label{sec-asymptotics} +\hypertarget{sec-asymptotics}{% +\chapter{Asymptotics}\label{sec-asymptotics}} -\section{Introduction}\label{introduction-2} +\hypertarget{introduction-2}{% +\section{Introduction}\label{introduction-2}} Suppose we are still interested in estimating the proportion of citizens who prefer increasing legal immigration. Based on the last chapter, a @@ -2404,8 +2444,9 @@ \section{Introduction}\label{introduction-2} chapter to estimate standard errors, construct confidence intervals, and perform hypothesis tests, all without assuming a fully parametric model. +\hypertarget{convergence-of-deterministic-sequences}{% \section{Convergence of deterministic -sequences}\label{convergence-of-deterministic-sequences} +sequences}\label{convergence-of-deterministic-sequences}} A helpful place to begin is by reviewing the basic idea of convergence in deterministic sequences from calculus: @@ -2437,7 +2478,13 @@ \section{Convergence of deterministic \(n_{\epsilon} = 4\) would satisfy this condition since \(1/4 \leq 0.3\). -\includegraphics{asymptotics_files/figure-pdf/sequence-1.pdf} +\begin{figure}[th] + +{\centering \includegraphics{asymptotics_files/figure-pdf/sequence-1.pdf} + +} + +\end{figure} More generally, for any \(\epsilon\), \(n \geq 1/\epsilon\) implies \(1/n \leq \epsilon\). Thus, setting \(n_{\epsilon} = 1/\epsilon\) @@ -2492,8 +2539,9 @@ \section{Convergence of deterministic notice that \(\P(X_n = 0) = 0\) because of the nature of continuous random variables. +\hypertarget{convergence-in-probability-and-consistency}{% \section{Convergence in probability and -consistency}\label{convergence-in-probability-and-consistency} +consistency}\label{convergence-in-probability-and-consistency}} A sequence of random variables can converge in several different ways. The first type of convergence deals with sequences converging to a @@ -2548,7 +2596,7 @@ \section{Convergence in probability and \end{example} -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Notation alert}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Notation alert}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] Sometimes convergence in probability is written as \(\text{plim}(Z_n) = b\) when \(Z_n \inprob b\), \(\text{plim}\) stands @@ -2576,7 +2624,7 @@ \section{Convergence in probability and will approach 0. Generally speaking, consistency is a very desirable property of an estimator. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] Estimators can be inconsistent yet still converge in probability to an understandable quantity. For example, we will discuss in later chapters @@ -2597,7 +2645,8 @@ \section{Convergence in probability and the vector converges to the corresponding element in \(\mb{b}\), or that \(X_{nj} \inprob b_j\) for all \(j = 1, \ldots, k\). -\section{Useful inequalities}\label{useful-inequalities} +\hypertarget{useful-inequalities}{% +\section{Useful inequalities}\label{useful-inequalities}} At first glance, establishing an estimator's consistency will be difficult. How can we know if a distribution will collapse to a specific @@ -2618,6 +2667,7 @@ \section{Useful inequalities}\label{useful-inequalities} \end{theorem} \begin{proof} + Note that we can let \(Y = |X|/\delta\) and rewrite the statement as \(\P(Y \geq 1) \leq \E[Y]\) (since \(\E[|X|]/\delta = \E[|X|/\delta]\) by the properties of expectation), which is what we will show. But also @@ -2631,6 +2681,7 @@ \section{Useful inequalities}\label{useful-inequalities} \(Y \geq 1\), so the inequality holds. If we take the expectation of both sides of this inequality, we obtain the result (remember, the expectation of an indicator function is the probability of the event). + \end{proof} In words, Markov's inequality says that the probability of a random @@ -2663,10 +2714,12 @@ \section{Useful inequalities}\label{useful-inequalities} \end{theorem} \begin{proof} + To prove this, we only need to square both sides of the inequality inside the probability statement and apply Markov's inequality: \[ \P\left( |X - \E[X]| \geq \delta \right) = \P((X-\E[X])^2 \geq \delta^2) \leq \frac{\E[(X - \E[X])^2]}{\delta^2} = \frac{\V[X]}{\delta^2}, \] with the last equality holding by the definition of variance. + \end{proof} Chebyshev's inequality is a straightforward extension of the Markov @@ -2682,7 +2735,8 @@ \section{Useful inequalities}\label{useful-inequalities} about 5\% of draws will be greater than 2 SDs away from the mean, much lower than the 25\% bound implied by Chebyshev's inequality. -\section{The law of large numbers}\label{the-law-of-large-numbers} +\hypertarget{the-law-of-large-numbers}{% +\section{The law of large numbers}\label{the-law-of-large-numbers}} We can now use these inequalities to show how estimators can be consistent for their target quantities of interest without making @@ -2704,6 +2758,7 @@ \section{The law of large numbers}\label{the-law-of-large-numbers} \end{theorem} \begin{proof} + Recall that the sample mean is unbiased, so \(\E[\Xbar_n] = \mu\) with sampling variance \(\sigma^2/n\). We can then apply Chebyshev to the sample mean to get \[ @@ -2711,6 +2766,7 @@ \section{The law of large numbers}\label{the-law-of-large-numbers} \] An \(n\rightarrow\infty\), the right-hand side goes to 0, which means that the left-hand side also must go to 0, which is the definition of \(\Xbar_n\) converging in probability to \(\mu\). + \end{proof} The weak law of large numbers (WLLN) shows that, under general @@ -2718,7 +2774,7 @@ \section{The law of large numbers}\label{the-law-of-large-numbers} \(n\rightarrow\infty\). This result holds even when the variance of the data is infinite, though researchers will rarely face such a situation. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] The naming of the ``weak'' law of large numbers seems to imply the existence of a ``strong'' law of large numbers (SLLN), which is true. @@ -2748,16 +2804,14 @@ \section{The law of large numbers}\label{the-law-of-large-numbers} \begin{figure}[th] -\centering{ - -\includegraphics{asymptotics_files/figure-pdf/fig-lln-sim-1.pdf} +{\centering \includegraphics{asymptotics_files/figure-pdf/fig-lln-sim-1.pdf} } \caption{\label{fig-lln-sim}Sampling distribution of the sample mean as a function of sample size.} -\end{figure}% +\end{figure} \end{example} @@ -2793,7 +2847,7 @@ \section{The law of large numbers}\label{the-law-of-large-numbers} \end{theorem} -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Notation alert}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Notation alert}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] Note that many of the formal results presented so far have ``moment conditions'' that certain moments are finite. For the vector WLLN, we @@ -2811,7 +2865,8 @@ \section{The law of large numbers}\label{the-law-of-large-numbers} \end{tcolorbox} -\section{Consistency of estimators}\label{consistency-of-estimators} +\hypertarget{consistency-of-estimators}{% +\section{Consistency of estimators}\label{consistency-of-estimators}} The WLLN shows that the sample mean of iid draws is consistent for the population mean, which is a massive result given that so many estimators @@ -2954,10 +3009,12 @@ \section{Consistency of estimators}\label{consistency-of-estimators} \end{theorem} \begin{proof} + Using Markov's inequality, we have \[ \P\left( |\widehat{\theta}_n - \theta| \geq \delta \right) = \P((\widehat{\theta}_n-\theta)^2 \geq \delta^2) \leq \frac{\E[(\widehat{\theta}_n - \theta)^2]}{\delta^2} = \frac{\text{bias}[\widehat{\theta}_n]^2 + \V[\widehat{\theta}]}{\delta^2} \to 0. \] The last inequality follows from the bias-variance decomposition of the mean squared error in Equation~\ref{eq-mse-decomposition}. + \end{proof} We can use this result to show consistency for a large range of @@ -2993,8 +3050,9 @@ \section{Consistency of estimators}\label{consistency-of-estimators} \end{example} +\hypertarget{convergence-in-distribution-and-the-central-limit-theorem}{% \section{Convergence in distribution and the central limit -theorem}\label{convergence-in-distribution-and-the-central-limit-theorem} +theorem}\label{convergence-in-distribution-and-the-central-limit-theorem}} Convergence in probability and the law of large numbers are beneficial for understanding how our estimators will (or will not) collapse to @@ -3041,7 +3099,13 @@ \section{Convergence in distribution and the central limit \] By inspection, this converges to \(\Phi(x)\), which is the cdf for the standard normal. This implies \(X_n \indist N(0, 1)\). -\includegraphics{asymptotics_files/figure-pdf/indist-1.pdf} +\begin{figure}[th] + +{\centering \includegraphics{asymptotics_files/figure-pdf/indist-1.pdf} + +} + +\end{figure} \end{example} @@ -3074,7 +3138,7 @@ \section{Convergence in distribution and the central limit binary, event count, continuous, or anything. The CLT is incredibly broadly applicable. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Notation alert}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Notation alert}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] Why do we state the CLT in terms of the sample mean after centering and scaling by its standard error? Suppose we don't normalize the sample @@ -3128,16 +3192,14 @@ \section{Convergence in distribution and the central limit \begin{figure}[th] -\centering{ - -\includegraphics{asymptotics_files/figure-pdf/fig-clt-1.pdf} +{\centering \includegraphics{asymptotics_files/figure-pdf/fig-clt-1.pdf} } \caption{\label{fig-clt}Sampling distributions of the normalized sample mean at n=30 and n=100.} -\end{figure}% +\end{figure} \end{example} @@ -3204,7 +3266,7 @@ \section{Convergence in distribution and the central limit random variables in \(\X_i\) and \(\mb{\Sigma}\) is the variance-covariance matrix for that vector. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] As with the notation alert with the WLLN, we are using shorthand here, \(\E\Vert \mb{X}_i \Vert^2 < \infty\), which implies that @@ -3214,7 +3276,8 @@ \section{Convergence in distribution and the central limit \end{tcolorbox} -\section{Confidence intervals}\label{confidence-intervals} +\hypertarget{confidence-intervals}{% +\section{Confidence intervals}\label{confidence-intervals}} We now turn to an essential application of the central limit theorem: confidence intervals. @@ -3271,7 +3334,7 @@ \section{Confidence intervals}\label{confidence-intervals} constructed in the exact same way, we should expect 95 of those confidence intervals to contain the true value. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, breakable, colbacktitle=quarto-callout-warning-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-warning-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-warning-color-frame] Suppose we have a 95\% confidence interval, \([0.1, 0.4]\). It would be tempting to make a probability statement like @@ -3303,8 +3366,9 @@ \section{Confidence intervals}\label{confidence-intervals} usually rely on large-sample approximations based on the central limit theorem. +\hypertarget{deriving-confidence-intervals}{% \subsection{Deriving confidence -intervals}\label{deriving-confidence-intervals} +intervals}\label{deriving-confidence-intervals}} To derive confidence intervals, consider the standard formula for the 95\% confidence interval of the sample mean, \[ @@ -3345,15 +3409,13 @@ \subsection{Deriving confidence \begin{figure}[th] -\centering{ - -\includegraphics{asymptotics_files/figure-pdf/fig-std-normal-1.pdf} +{\centering \includegraphics{asymptotics_files/figure-pdf/fig-std-normal-1.pdf} } \caption{\label{fig-std-normal}Critical values for the standard normal.} -\end{figure}% +\end{figure} How can we generalize this to \(1-\alpha\) confidence intervals? For a random variable that is distributed following a standard normal, \(Z\), @@ -3383,8 +3445,9 @@ \subsection{Deriving confidence \left[\Xbar_{n} - 1.64 \frac{\widehat{\sigma}}{\sqrt{n}}, \Xbar_{n} + 1.64 \frac{\widehat{\sigma}}{\sqrt{n}}\right] \] +\hypertarget{interpreting-confidence-intervals}{% \subsection{Interpreting confidence -intervals}\label{interpreting-confidence-intervals} +intervals}\label{interpreting-confidence-intervals}} A very important point is that the interpretation of confidence is how the random interval performs over repeated samples. A valid 95\% @@ -3422,9 +3485,7 @@ \subsection{Interpreting confidence \begin{figure}[th] -\centering{ - -\includegraphics{asymptotics_files/figure-pdf/fig-ci-sim-1.pdf} +{\centering \includegraphics{asymptotics_files/figure-pdf/fig-ci-sim-1.pdf} } @@ -3432,11 +3493,12 @@ \subsection{Interpreting confidence samples. Intervals are blue if they contain the truth and red if they do not.} -\end{figure}% +\end{figure} \end{example} -\section{Delta method}\label{sec-delta-method} +\hypertarget{sec-delta-method}{% +\section{Delta method}\label{sec-delta-method}} Suppose that we know that an estimator follows the CLT, and so we have \[ @@ -3482,15 +3544,13 @@ \section{Delta method}\label{sec-delta-method} \begin{figure}[th] -\centering{ - -\includegraphics{asymptotics_files/figure-pdf/fig-delta-1.pdf} +{\centering \includegraphics{asymptotics_files/figure-pdf/fig-delta-1.pdf} } \caption{\label{fig-delta}Linear approximation to nonlinear functions.} -\end{figure}% +\end{figure} \begin{example}[]\protect\hypertarget{exm-log}{}\label{exm-log} @@ -3568,7 +3628,8 @@ \section{Delta method}\label{sec-delta-method} method, but it is more easily adaptable across different estimators and domains. -\section{Summary}\label{summary-2} +\hypertarget{summary-2}{% +\section{Summary}\label{summary-2}} In this chapter, we covered asymptotic analysis, which considers how estimators behave as we feed them larger and larger samples. While we @@ -3592,7 +3653,8 @@ \section{Summary}\label{summary-2} introduce another important tool for statistical inference: the hypothesis test. -\chapter{Hypothesis tests}\label{sec-hypothesis-tests} +\hypertarget{sec-hypothesis-tests}{% +\chapter{Hypothesis tests}\label{sec-hypothesis-tests}} We have up to now discussed the properties of estimators that allow us to characterize their distributions in finite and large samples. These @@ -3607,7 +3669,8 @@ \chapter{Hypothesis tests}\label{sec-hypothesis-tests} One of the most ubiquitous in the social sciences is the hypothesis test, a kind of statistical thought experiment. -\section{The Lady Tasting Tea}\label{the-lady-tasting-tea} +\hypertarget{the-lady-tasting-tea}{% +\section{The Lady Tasting Tea}\label{the-lady-tasting-tea}} The story of the Lady Tasting Tea exemplifies the core ideas behind hypothesis testing.\footnote{The analysis here largely comes from Senn @@ -3653,7 +3716,7 @@ \section{The Lady Tasting Tea}\label{the-lady-tasting-tea} Thus, hypothesis tests help us assess evidence for particular guesses about the DGP. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Notation alert}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Notation alert}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] For the rest of this chapter, we will introduce the concepts following the notation in the past chapters. We will assume a random (iid) sample @@ -3665,7 +3728,8 @@ \section{The Lady Tasting Tea}\label{the-lady-tasting-tea} \end{tcolorbox} -\section{Hypotheses}\label{hypotheses} +\hypertarget{hypotheses}{% +\section{Hypotheses}\label{hypotheses}} In the context of hypothesis testing, hypotheses are simply statements about the population distribution. In particular, we will make @@ -3744,8 +3808,9 @@ \section{Hypotheses}\label{hypotheses} should be used with extreme caution. That said, unfortunately, the math of two-sided tests is also more complicated. +\hypertarget{the-procedure-of-hypothesis-testing}{% \section{The procedure of hypothesis -testing}\label{the-procedure-of-hypothesis-testing} +testing}\label{the-procedure-of-hypothesis-testing}} At the most basic level, a \textbf{hypothesis test} is a rule that specifies values of the sample data for which we will decide to @@ -3819,7 +3884,8 @@ \section{The procedure of hypothesis since we want to count deviations from either side of the null hypothesis as evidence against that null. -\section{Testing errors}\label{testing-errors} +\hypertarget{testing-errors}{% +\section{Testing errors}\label{testing-errors}} Hypothesis tests end with a decision to reject the null hypothesis or not, but this might be an incorrect decision. In particular, there are @@ -3845,8 +3911,9 @@ \section{Testing errors}\label{testing-errors} but we concluded there wasn't (we failed to reject the null hypothesis of no difference). +\hypertarget{tbl-errors}{} \begin{longtable}[]{@{}lll@{}} -\caption{Typology of testing errors}\label{tbl-errors}\tabularnewline +\caption{\label{tbl-errors}Typology of testing errors}\tabularnewline \toprule\noalign{} & \(H_0\) True & \(H_0\) False \\ \midrule\noalign{} @@ -3918,16 +3985,14 @@ \section{Testing errors}\label{testing-errors} \begin{figure}[th] -\centering{ - -\includegraphics{hypothesis_tests_files/figure-pdf/fig-size-power-1.pdf} +{\centering \includegraphics{hypothesis_tests_files/figure-pdf/fig-size-power-1.pdf} } \caption{\label{fig-size-power}Size of a test and power against an alternative.} -\end{figure}% +\end{figure} Figure~\ref{fig-size-power} also hints at a tradeoff between size and power. Notice that we could make the size smaller (lower the false @@ -3940,8 +4005,9 @@ \section{Testing errors}\label{testing-errors} \(\P(T > c' \mid \theta_1) < \P(T > c \mid \theta_1)\). This means we usually cannot simultaneously reduce both types of errors. +\hypertarget{determining-the-rejection-region}{% \section{Determining the rejection -region}\label{determining-the-rejection-region} +region}\label{determining-the-rejection-region}} If we cannot simultaneously optimize a test's size and power, how should we determine where the rejection region is? That is, how should we @@ -4004,18 +4070,17 @@ \section{Determining the rejection \begin{figure}[th] -\centering{ - -\includegraphics{hypothesis_tests_files/figure-pdf/fig-two-sided-1.pdf} +{\centering \includegraphics{hypothesis_tests_files/figure-pdf/fig-two-sided-1.pdf} } \caption{\label{fig-two-sided}Rejection regions for a two-sided test.} -\end{figure}% +\end{figure} +\hypertarget{hypothesis-tests-of-the-sample-mean}{% \section{Hypothesis tests of the sample -mean}\label{hypothesis-tests-of-the-sample-mean} +mean}\label{hypothesis-tests-of-the-sample-mean}} Consider the following extended example about hypothesis testing of a sample mean, sometimes called a \textbf{one-sample test} since we are @@ -4039,7 +4104,7 @@ \section{Hypothesis tests of the sample \(T \indist \N(0, 1)\). Thus, we can approximate the null distribution with the standard normal. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, breakable, colbacktitle=quarto-callout-warning-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-warning-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-warning-color-frame] The names of the various tests can be quite confusing because they are so similar. Earlier, we discussed one-sided versus two-sided tests, @@ -4063,7 +4128,8 @@ \section{Hypothesis tests of the sample \] This means that a test where we reject when \(|T| > 1.96\) would have a level of 0.05 asymptotically. -\section{The Wald test}\label{the-wald-test} +\hypertarget{the-wald-test}{% +\section{The Wald test}\label{the-wald-test}} We can generalize the hypothesis test for the sample mean to estimators more broadly. Let \(\widehat{\theta}_n\) be an estimator for some @@ -4083,7 +4149,7 @@ \section{The Wald test}\label{the-wald-test} standard normal. That is, if \(Z \sim \N(0, 1)\), then \(z_{\alpha/2}\) satisfies \(\P(Z \geq z_{\alpha/2}) = \alpha/2\). -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] In R, you can find the \(z_{\alpha/2}\) values easily with the \texttt{qnorm()} function: @@ -4194,7 +4260,8 @@ \section{The Wald test}\label{the-wald-test} \end{example} -\section{p-values}\label{p-values} +\hypertarget{p-values}{% +\section{p-values}\label{p-values}} The hypothesis testing framework focuses on making a decision -- to reject the null hypothesis or not -- in the face of uncertainty. You @@ -4250,7 +4317,7 @@ \section{p-values}\label{p-values} a transformation of the test statistic onto a common scale between 0 and 1. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, breakable, colbacktitle=quarto-callout-warning-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-warning-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-warning-color-frame] People use many statistical shibboleths to purportedly identify people who don't understand statistics, and these criticisms sometimes hinge on @@ -4268,7 +4335,8 @@ \section{p-values}\label{p-values} \end{tcolorbox} -\section{Power analysis}\label{power-analysis} +\hypertarget{power-analysis}{% +\section{Power analysis}\label{power-analysis}} Imagine you have spent a large amount of your research budget on a big experiment that tests a new and exciting theory, but the results come @@ -4308,8 +4376,9 @@ \section{Power analysis}\label{power-analysis} \end{theorem} +\hypertarget{exact-tests-under-normal-data}{% \section{Exact tests under normal -data}\label{exact-tests-under-normal-data} +data}\label{exact-tests-under-normal-data}} The Wald test above relies on large-sample approximations but these may not be valid in finate samples. Can we get \textbf{exact} inferences at @@ -4335,18 +4404,17 @@ \section{Exact tests under normal \begin{figure}[th] -\centering{ - -\includegraphics{hypothesis_tests_files/figure-pdf/fig-shape-of-t-1.pdf} +{\centering \includegraphics{hypothesis_tests_files/figure-pdf/fig-shape-of-t-1.pdf} } \caption{\label{fig-shape-of-t}Normal versus t distribution.} -\end{figure}% +\end{figure} +\hypertarget{confidence-intervals-and-hypothesis-tests}{% \section{Confidence intervals and hypothesis -tests}\label{confidence-intervals-and-hypothesis-tests} +tests}\label{confidence-intervals-and-hypothesis-tests}} At first glance, we may seem sloppy in using \(\alpha\) in deriving a \(1 - \alpha\) confidence interval in the last chapter and an @@ -4401,7 +4469,8 @@ \section{Confidence intervals and hypothesis experiments. \end{enumerate} -\section{Summary}\label{summary-3} +\hypertarget{summary-3}{% +\section{Summary}\label{summary-3}} In this chapter, we covered the basics of hypothesis tests, which are a type of statistical thought experiment. We assume that we know the true @@ -4426,7 +4495,8 @@ \section{Summary}\label{summary-3} \part{Regression} -\chapter{Linear regression}\label{sec-regression} +\hypertarget{sec-regression}{% +\chapter{Linear regression}\label{sec-regression}} Regression is simply a set of tools for evaluating the relationship between an \textbf{outcome variable}, \(Y_i\), and a set of @@ -4469,7 +4539,8 @@ \chapter{Linear regression}\label{sec-regression} the regressors, inputs, or features \end{itemize} -\section{Why do we need models?}\label{why-do-we-need-models} +\hypertarget{why-do-we-need-models}{% +\section{Why do we need models?}\label{why-do-we-need-models}} At first glance, the connection between the CEF and parametric models might be hazy. For example, imagine we are interested in estimating the @@ -4532,9 +4603,7 @@ \section{Why do we need models?}\label{why-do-we-need-models} \begin{figure}[th] -\centering{ - -\includegraphics{linear_model_files/figure-pdf/fig-cef-binned-1.pdf} +{\centering \includegraphics{linear_model_files/figure-pdf/fig-cef-binned-1.pdf} } @@ -4542,7 +4611,7 @@ \section{Why do we need models?}\label{why-do-we-need-models} and poll wait times (contour plot), conditional expectation function (red), and the conditional expectation of the binned income (blue).} -\end{figure}% +\end{figure} Similarly, we could \textbf{assume} that the CEF follows a simple functional form such as a line: \[ @@ -4567,10 +4636,12 @@ \section{Why do we need models?}\label{why-do-we-need-models} assumptions and then investigating how well this estimand approximates the true CEF. -\section{Population linear regression}\label{sec-linear-projection} +\hypertarget{sec-linear-projection}{% +\section{Population linear regression}\label{sec-linear-projection}} +\hypertarget{bivariate-linear-regression}{% \subsection{Bivariate linear -regression}\label{bivariate-linear-regression} +regression}\label{bivariate-linear-regression}} Let's set aside the idea of the conditional expectation function and instead focus on finding the \textbf{linear} function of a single @@ -4642,19 +4713,18 @@ \subsection{Bivariate linear \begin{figure}[th] -\centering{ - -\includegraphics{linear_model_files/figure-pdf/fig-cef-blp-1.pdf} +{\centering \includegraphics{linear_model_files/figure-pdf/fig-cef-blp-1.pdf} } \caption{\label{fig-cef-blp}Comparison of the CEF and the best linear predictor.} -\end{figure}% +\end{figure} +\hypertarget{beyond-linear-approximations}{% \subsection{Beyond linear -approximations}\label{beyond-linear-approximations} +approximations}\label{beyond-linear-approximations}} The linear part of the ``best linear predictor'' is less restrictive than it appears at first glance. We can easily modify the minimum MSE @@ -4673,8 +4743,9 @@ \subsection{Beyond linear usually pay for this flexibility with overfitting and high variance in our estimates. +\hypertarget{linear-prediction-with-multiple-covariates}{% \subsection{Linear prediction with multiple -covariates}\label{linear-prediction-with-multiple-covariates} +covariates}\label{linear-prediction-with-multiple-covariates}} We now generalize the idea of a best linear predictor to a setting with an arbitrary number of covariates, which more flexibly captures @@ -4692,7 +4763,7 @@ \subsection{Linear prediction with multiple expected mean-squared error, where the expectation is over the joint distribution of the data. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Best linear projection assumptions}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Best linear projection assumptions}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] Without some assumptions on the joint distribution of the data, the following ``regularity conditions'' will ensure the existence of the @@ -4746,7 +4817,7 @@ \subsection{Linear prediction with multiple \(k\times 1\) column vector, which implies that \(\bfbeta\) is also a \(k \times 1\) column vector. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] What does the expression for the population regression coefficients mean? It is helpful to separate the intercept or constant term so that @@ -4772,7 +4843,8 @@ \subsection{Linear prediction with multiple m(\X_{i}) = \X_{i}'\left(\E[\X_{i}\X_{i}']\right)^{-1}\E[\X_{i}Y_{i}] = \X_{i}'\mb{Q}_{\mb{XX}}^{-1}\mb{Q}_{\mb{X}Y} \] -\subsection{Projection error}\label{projection-error} +\hypertarget{projection-error}{% +\subsection{Projection error}\label{projection-error}} The \textbf{projection error} or is the difference between the actual value of \(Y_i\) and the projection, \[ @@ -4844,8 +4916,9 @@ \subsection{Projection error}\label{projection-error} still targets a perfectly valid quantity of interest: the coefficients from this population linear projection. +\hypertarget{linear-cefs-without-assumptions}{% \section{Linear CEFs without -assumptions}\label{linear-cefs-without-assumptions} +assumptions}\label{linear-cefs-without-assumptions}} What is the relationship between the best linear predictor (which we just saw generally exists) and the CEF? To draw the connection, remember @@ -5033,8 +5106,9 @@ \section{Linear CEFs without linear function without assumptions. The above examples show how to construct saturated models in various situations. +\hypertarget{interpretation-of-the-regression-coefficients}{% \section{Interpretation of the regression -coefficients}\label{interpretation-of-the-regression-coefficients} +coefficients}\label{interpretation-of-the-regression-coefficients}} We have seen how to interpret population regression coefficients when the CEF is linear without assumptions. How do we interpret the @@ -5062,8 +5136,9 @@ \section{Interpretation of the regression ``all else equal'' difference in the predicted outcome for each covariate. +\hypertarget{polynomial-functions-of-the-covariates}{% \subsection{Polynomial functions of the -covariates}\label{polynomial-functions-of-the-covariates} +covariates}\label{polynomial-functions-of-the-covariates}} The interpretation of the population regression coefficients becomes more complicated when including nonlinear functions of the covariates. @@ -5096,7 +5171,8 @@ \subsection{Polynomial functions of the function (perhaps using the orthogonalization techniques in Section~\ref{sec-fwl}). -\subsection{Interactions}\label{interactions} +\hypertarget{interactions}{% +\subsection{Interactions}\label{interactions}} Another common nonlinear function occurs when including \textbf{interaction terms} or covariates that are products of two other @@ -5136,7 +5212,7 @@ \subsection{Interactions}\label{interactions} \(\beta_3\) represents the change in the slope of the wait time-income relationship between Black and non-Black voters. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Centering variables to improve interpretability}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Centering variables to improve interpretability}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] In many cases, the so-called marginal coefficients on the lower-order terms (\(\beta_1\) for \(X_{i1}\) and \(\beta_2\) for \(X_{i2}\)) are @@ -5160,16 +5236,18 @@ \subsection{Interactions}\label{interactions} \end{tcolorbox} -\section{Multiple regression from bivariate regression}\label{sec-fwl} +\hypertarget{sec-fwl}{% +\section{Multiple regression from bivariate regression}\label{sec-fwl}} With a regression of an outcome on two covariates, understanding how the coefficients of one variable relate to the other is helpful. Consider the following best linear projection: -\begin{equation}\phantomsection\label{eq-two-var-blp}{ +\begin{equation}\protect\hypertarget{eq-two-var-blp}{}{ (\alpha, \beta, \gamma) = \argmin_{(a,b,c) \in \mathbb{R}^{3}} \; \E[(Y_{i} - (a + bX_{i} + cZ_{i}))^{2}] -}\end{equation} Can we understand the \(\beta\) coefficient here in -terms of a bivariate regression? As it turns out, yes. From the above -results, we know that the intercept has a simple form: \[ +}\label{eq-two-var-blp}\end{equation} Can we understand the \(\beta\) +coefficient here in terms of a bivariate regression? As it turns out, +yes. From the above results, we know that the intercept has a simple +form: \[ \alpha = \E[Y_i] - \beta\E[X_i] - \gamma\E[Z_i]. \] Let's investigate the first order condition for \(\beta\): \[ \begin{aligned} @@ -5213,7 +5291,8 @@ \section{Multiple regression from bivariate regression}\label{sec-fwl} the relationship between the outcome and the covariate after removing the linear relationships of all other variables. -\section{Omitted variable bias}\label{omitted-variable-bias} +\hypertarget{omitted-variable-bias}{% +\section{Omitted variable bias}\label{omitted-variable-bias}} In many situations, we may need to choose whether to include a variable in a regression, so it can be helpful to understand how this choice @@ -5257,7 +5336,8 @@ \section{Omitted variable bias}\label{omitted-variable-bias} them here as features of a particular population quantity, the linear projection or population linear regression. -\section{Drawbacks of the BLP}\label{drawbacks-of-the-blp} +\hypertarget{drawbacks-of-the-blp}{% +\section{Drawbacks of the BLP}\label{drawbacks-of-the-blp}} The best linear predictor is, of course, a \emph{linear} approximation to the CEF, and this approximation could be quite poor if the true CEF @@ -5272,18 +5352,17 @@ \section{Drawbacks of the BLP}\label{drawbacks-of-the-blp} \begin{figure}[th] -\centering{ - -\includegraphics{linear_model_files/figure-pdf/fig-blp-limits-1.pdf} +{\centering \includegraphics{linear_model_files/figure-pdf/fig-blp-limits-1.pdf} } \caption{\label{fig-blp-limits}Linear projections for when truncating income distribution below \$50k and above \$100k.} -\end{figure}% +\end{figure} -\section{Summary}\label{summary-4} +\hypertarget{summary-4}{% +\section{Summary}\label{summary-4}} As we discussed in this chapter, with even a moderate number of covariates, conditional expectation functions (also known as @@ -5305,7 +5384,8 @@ \section{Summary}\label{summary-4} independent variables. In the next chapter, we will turn to using data to estimate the coefficients for these population linear regressions. -\chapter{The mechanics of least squares}\label{sec-ols-mechanics} +\hypertarget{sec-ols-mechanics}{% +\chapter{The mechanics of least squares}\label{sec-ols-mechanics}} This chapter explores the most widely used estimator for population linear regressions: \textbf{ordinary least squares} (OLS). OLS is a @@ -5326,9 +5406,7 @@ \chapter{The mechanics of least squares}\label{sec-ols-mechanics} \begin{figure}[th] -\centering{ - -\includegraphics{least_squares_files/figure-pdf/fig-ajr-scatter-1.pdf} +{\centering \includegraphics{least_squares_files/figure-pdf/fig-ajr-scatter-1.pdf} } @@ -5336,9 +5414,10 @@ \chapter{The mechanics of least squares}\label{sec-ols-mechanics} institutions and economic development from Acemoglu, Johnson, and Robinson (2001).} -\end{figure}% +\end{figure} -\section{Deriving the OLS estimator}\label{deriving-the-ols-estimator} +\hypertarget{deriving-the-ols-estimator}{% +\section{Deriving the OLS estimator}\label{deriving-the-ols-estimator}} The last chapter on the linear model and the best linear projection operated purely in the population, not samples. We derived the @@ -5349,7 +5428,7 @@ \section{Deriving the OLS estimator}\label{deriving-the-ols-estimator} population and the population coefficients. To do this, we will focus on the OLS estimator for these population quantities. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Assumption}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Assumption}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] The variables \(\{(Y_1, \X_1), \ldots, (Y_i,\X_i), \ldots, (Y_n, \X_n)\}\) are i.i.d. @@ -5428,7 +5507,7 @@ \section{Deriving the OLS estimator}\label{deriving-the-ols-estimator} \end{theorem} -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Formula for the OLS slopes}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Formula for the OLS slopes}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] Almost all regression will contain an intercept term, usually represented as a constant 1 in the covariate vector. It is also possible @@ -5468,18 +5547,17 @@ \section{Deriving the OLS estimator}\label{deriving-the-ols-estimator} \begin{figure}[th] -\centering{ - -\includegraphics{least_squares_files/figure-pdf/fig-ssr-comp-1.pdf} +{\centering \includegraphics{least_squares_files/figure-pdf/fig-ssr-comp-1.pdf} } \caption{\label{fig-ssr-comp}Different possible lines and their corresponding sum of squared residuals.} -\end{figure}% +\end{figure} -\section{Model fit}\label{model-fit} +\hypertarget{model-fit}{% +\section{Model fit}\label{model-fit}} We have learned how to use OLS to obtain an estimate of the best linear predictor, but an open question is whether that prediction is any good. @@ -5499,16 +5577,14 @@ \section{Model fit}\label{model-fit} \begin{figure}[th] -\centering{ - -\includegraphics{least_squares_files/figure-pdf/fig-ssr-vs-tss-1.pdf} +{\centering \includegraphics{least_squares_files/figure-pdf/fig-ssr-vs-tss-1.pdf} } \caption{\label{fig-ssr-vs-tss}Total sum of squares vs.~the sum of squared residuals.} -\end{figure}% +\end{figure} We can use the \textbf{proportion reduction in prediction error} from adding those covariates to measure how much those covariates improve the @@ -5536,7 +5612,8 @@ \section{Model fit}\label{model-fit} fit, which occurs when all data points are perfectly predicted by the model with zero residuals. -\section{Matrix form of OLS}\label{matrix-form-of-ols} +\hypertarget{matrix-form-of-ols}{% +\section{Matrix form of OLS}\label{matrix-form-of-ols}} We derived the OLS estimator above using simple algebra and calculus, but a more common representation of the estimator relies on vectors and @@ -5615,8 +5692,9 @@ \section{Matrix form of OLS}\label{matrix-form-of-ols} \mathbb{X}'\widehat{\mb{e}} = \sum_{i=1}^{n} \X_{i}\widehat{e}_{i} = 0, \] which also implies these vectors are \textbf{orthogonal}. +\hypertarget{sec-rank}{% \section{Rank, linear independence, and -multicollinearity}\label{sec-rank} +multicollinearity}\label{sec-rank}} We noted that the OLS estimator exists when \(\sum_{i=1}^n \X_i\X_i'\) is positive definite or that there is ``no multicollinearity.'' This @@ -5692,8 +5770,9 @@ \section{Rank, linear independence, and covariates as is necessary to achieve full rank. R will show the estimated coefficients as \texttt{NA} in those cases. +\hypertarget{ols-coefficients-for-binary-and-categorical-regressors}{% \section{OLS coefficients for binary and categorical -regressors}\label{ols-coefficients-for-binary-and-categorical-regressors} +regressors}\label{ols-coefficients-for-binary-and-categorical-regressors}} Suppose that the covariates include just the intercept and a single binary variable, \(\X_i = (1\; X_{i})'\), where \(X_i \in \{0,1\}\). In @@ -5719,8 +5798,9 @@ \section{OLS coefficients for binary and categorical These exact relationships fail when other covariates are added to the model. +\hypertarget{projection-and-geometry-of-least-squares}{% \section{Projection and geometry of least -squares}\label{projection-and-geometry-of-least-squares} +squares}\label{projection-and-geometry-of-least-squares}} OLS has a very nice geometric interpretation that adds a lot of intuition for various aspects of the method. In this geometric approach, @@ -5754,16 +5834,14 @@ \section{Projection and geometry of least \begin{figure}[th] -\centering{ - -\includegraphics{assets/img/projection-drawing.png} +{\centering \includegraphics{assets/img/projection-drawing.png} } \caption{\label{fig-projection}Projection of Y on the column space of the covariates.} -\end{figure}% +\end{figure} This figure shows that the residual vector, which is the difference between the \(\mb{Y}\) vector and the projection \(\Xmat\bhat\), is @@ -5774,8 +5852,9 @@ \section{Projection and geometry of least \] as we established above. Being orthogonal to all the columns means it will also be orthogonal to all linear combinations of the columns. +\hypertarget{projection-and-annihilator-matrices}{% \section{Projection and annihilator -matrices}\label{projection-and-annihilator-matrices} +matrices}\label{projection-and-annihilator-matrices}} With the idea of projection to the column space of \(\Xmat\) established, we can define a way to project any vector into that space. @@ -5836,7 +5915,8 @@ \section{Projection and annihilator \mb{Y} = \Xmat\bhat + \widehat{\mb{e}} = \mb{P}_{\Xmat}\mb{Y} + \mb{M}_{\Xmat}\mb{Y}. \] -\section{Residual regression}\label{residual-regression} +\hypertarget{residual-regression}{% +\section{Residual regression}\label{residual-regression}} There are many situations where we can partition the covariates into two groups, and we might wonder if it is possible to express or calculate @@ -5853,7 +5933,7 @@ \section{Residual regression}\label{residual-regression} \textbf{partitioned regression}, or the \textbf{Frisch-Waugh-Lovell theorem}. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Residual regression approach}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Residual regression approach}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] The residual regression approach is: @@ -5913,8 +5993,9 @@ \section{Residual regression}\label{residual-regression} conditional relationship appears linear or should be modeled in another way. +\hypertarget{outliers-leverage-points-and-influential-observations}{% \section{Outliers, leverage points, and influential -observations}\label{outliers-leverage-points-and-influential-observations} +observations}\label{outliers-leverage-points-and-influential-observations}} Given that OLS finds the coefficients that minimize the sum of the squared residuals, asking how much impact each residual has on that @@ -5947,7 +6028,8 @@ \section{Outliers, leverage points, and influential We'll take each of these in turn. -\subsection{Leverage points}\label{sec-leverage} +\hypertarget{sec-leverage}{% +\subsection{Leverage points}\label{sec-leverage}} We can define the \textbf{leverage} of an observation by \[ h_{ii} = \X_{i}'\left(\Xmat'\Xmat\right)^{-1}\X_{i}, @@ -5980,8 +6062,9 @@ \subsection{Leverage points}\label{sec-leverage} \(\sum_{i=1}^{n} h_{ii} = k + 1\) \end{enumerate} +\hypertarget{outliers-and-leave-one-out-regression}{% \subsection{Outliers and leave-one-out -regression}\label{outliers-and-leave-one-out-regression} +regression}\label{outliers-and-leave-one-out-regression}} In the context of OLS, an \textbf{outlier} is an observation with a large prediction error for a particular OLS specification. @@ -5989,15 +6072,13 @@ \subsection{Outliers and leave-one-out \begin{figure}[th] -\centering{ - -\includegraphics{least_squares_files/figure-pdf/fig-outlier-1.pdf} +{\centering \includegraphics{least_squares_files/figure-pdf/fig-outlier-1.pdf} } \caption{\label{fig-outlier}An example of an outlier.} -\end{figure}% +\end{figure} Intuitively, it seems as though we could use the residual \(\widehat{e}_i\) to assess the prediction error for a given unit. But @@ -6017,16 +6098,17 @@ \subsection{Outliers and leave-one-out computationally costly because it seems as though we have to fit OLS \(n\) times. Fortunately, there is a closed-form expression for the LOO coefficients and prediction errors in terms of the original regression, -\begin{equation}\phantomsection\label{eq-loo-coefs}{ +\begin{equation}\protect\hypertarget{eq-loo-coefs}{}{ \bhat_{(-i)} = \bhat - \left( \Xmat'\Xmat\right)^{-1}\X_i\widetilde{e}_i \qquad \widetilde{e}_i = \frac{\widehat{e}_i}{1 - h_{ii}}. -}\end{equation} This shows that the LOO prediction errors will differ -from the residuals when the leverage of a unit is high. This makes -sense! We said earlier that observations with low leverage would be -close to \(\overline{\X}\), where the outcome values have relatively -little impact on the OLS fit (because the regression line must go -through \(\overline{Y}\)). +}\label{eq-loo-coefs}\end{equation} This shows that the LOO prediction +errors will differ from the residuals when the leverage of a unit is +high. This makes sense! We said earlier that observations with low +leverage would be close to \(\overline{\X}\), where the outcome values +have relatively little impact on the OLS fit (because the regression +line must go through \(\overline{Y}\)). -\subsection{Influential observations}\label{influential-observations} +\hypertarget{influential-observations}{% +\subsection{Influential observations}\label{influential-observations}} An influential observation (also sometimes called an influential point) is a unit that has the power to change the coefficients and fitted @@ -6035,15 +6117,13 @@ \subsection{Influential observations}\label{influential-observations} \begin{figure}[th] -\centering{ - -\includegraphics{least_squares_files/figure-pdf/fig-influence-1.pdf} +{\centering \includegraphics{least_squares_files/figure-pdf/fig-influence-1.pdf} } \caption{\label{fig-influence}An example of an influence point.} -\end{figure}% +\end{figure} One measure of influence, called DFBETA\(_i\), measures how much \(i\) changes the estimated coefficient vector \[ @@ -6080,7 +6160,8 @@ \subsection{Influential observations}\label{influential-observations} outliers. Finally, consider using methods that are robust to outliers such as least absolute deviations or least trimmed squares. -\section{Summary}\label{summary-5} +\hypertarget{summary-5}{% +\section{Summary}\label{summary-5}} In this chapter, we introduced the \textbf{ordinary least squares} estimator, which finds the linear function of the \(\X_i\) that @@ -6101,7 +6182,8 @@ \section{Summary}\label{summary-5} move from the mechanical properties to the statistical properties of OLS: unbiasedness, consistency, and asymptotic normality. -\chapter{The statistics of least squares}\label{sec-ols-statistics} +\hypertarget{sec-ols-statistics}{% +\chapter{The statistics of least squares}\label{sec-ols-statistics}} The last chapter showcased the least squares estimator and investigated many of its more mechanical properties, which are essential for the @@ -6125,8 +6207,9 @@ \chapter{The statistics of least squares}\label{sec-ols-statistics} assumptions are very strong, so understanding what we can say about OLS without them is vital. +\hypertarget{large-sample-properties-of-ols}{% \section{Large-sample properties of -OLS}\label{large-sample-properties-of-ols} +OLS}\label{large-sample-properties-of-ols}} As we saw in Chapter~\ref{sec-asymptotics}, we need two key ingredients to conduct statistical inference with the OLS estimator: (1) a @@ -6143,7 +6226,7 @@ \section{Large-sample properties of \(\bfbeta = \E[\X_{i}\X_{i}']^{-1}\E[\X_{i}Y_{i}]\), is well-defined and unique. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Linear projection assumptions}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Linear projection assumptions}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] The linear projection model makes the following assumptions: @@ -6206,15 +6289,15 @@ \section{Large-sample properties of OLS coefficients. We first review some key ideas about the Central Limit Theorem. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{CLT reminder}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{CLT reminder}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] Suppose that we have a function of the data iid random vectors \(\X_1, \ldots, \X_n\), \(g(\X_{i})\) where \(\E[g(\X_{i})] = 0\) and so \(\V[g(\X_{i})] = \E[g(\X_{i})g(\X_{i})']\). Then if \(\E[\Vert g(\X_{i})\Vert^{2}] < \infty\), the CLT implies that -\begin{equation}\phantomsection\label{eq-clt-mean-zero}{ +\begin{equation}\protect\hypertarget{eq-clt-mean-zero}{}{ \sqrt{n}\left(\frac{1}{n} \sum_{i=1}^{n} g(\X_{i}) - \E[g(\X_{i})]\right) = \frac{1}{\sqrt{n}} \sum_{i=1}^{n} g(\X_{i}) \indist \N(0, \E[g(\X_{i})g(\X_{i}')]) -}\end{equation} +}\label{eq-clt-mean-zero}\end{equation} \end{tcolorbox} @@ -6259,7 +6342,9 @@ \section{Large-sample properties of coefficients and groups of coefficients. But, first, we need an estimate of the covariance matrix. -\section{Variance estimation for OLS}\label{variance-estimation-for-ols} +\hypertarget{variance-estimation-for-ols}{% +\section{Variance estimation for +OLS}\label{variance-estimation-for-ols}} The asymptotic normality of OLS from the last section is of limited value without some way to estimate the covariance matrix, \[ @@ -6360,8 +6445,9 @@ \section{Variance estimation for OLS}\label{variance-estimation-for-ols} and confidence intervals. Still, this difference is of little consequence in large samples. +\hypertarget{inference-for-multiple-parameters}{% \section{Inference for multiple -parameters}\label{inference-for-multiple-parameters} +parameters}\label{inference-for-multiple-parameters}} With multiple coefficients, we might have hypotheses that involve more than one coefficient. As an example, consider a regression with an @@ -6382,9 +6468,9 @@ \section{Inference for multiple \] and we usually take the absolute value, \(|t|\), as our measure of how extreme our estimate is given the null distribution. But notice that we could also use the square of the \(t\) statistic, which is -\begin{equation}\phantomsection\label{eq-squared-t}{ +\begin{equation}\protect\hypertarget{eq-squared-t}{}{ t^{2} = \frac{\left(\widehat{\beta}_{j} - b_{0}\right)^{2}}{\V[\widehat{\beta}_{j}]} = \frac{n\left(\widehat{\beta}_{j} - b_{0}\right)^{2}}{[\mb{V}_{\bfbeta}]_{[jj]}}. -}\end{equation} +}\label{eq-squared-t}\end{equation} While \(|t|\) is the usual test statistic we use for two-sided tests, we could equivalently use \(t^2\) and arrive at the exact same conclusions @@ -6426,19 +6512,19 @@ \section{Inference for multiple \(\widehat{\bs{\theta}} = \mb{L}\bhat\) be the OLS estimate of the function of the coefficients. By the delta method (discussed in Section~\ref{sec-delta-method}), we have \[ -\sqrt{n}\left(\mb{L}\bhat - \mb{L}\bfbeta\right) \indist \N(0, \mb{L}'\mb{V}_{\bfbeta}\mb{L}). +\sqrt{n}\left(\mb{L}\bhat - \mb{L}\bfbeta\right) \indist \N(0, \mb{L}\mb{V}_{\bfbeta}\mb{L}'). \] We can now generalize the squared \(t\) statistic in Equation~\ref{eq-squared-t} by taking the distances \(\mb{L}\bhat - \mb{c}\) weighted by the variance-covariance matrix -\(\mb{L}'\mb{V}_{\bfbeta}\mb{L}\), \[ -W = n(\mb{L}\bhat - \mb{c})'(\mb{L}'\mb{V}_{\bfbeta}\mb{L})^{-1}(\mb{L}\bhat - \mb{c}), +\(\mb{L}\mb{V}_{\bfbeta}\mb{L}'\), \[ +W = n(\mb{L}\bhat - \mb{c})'(\mb{L}\mb{V}_{\bfbeta}\mb{L}')^{-1}(\mb{L}\bhat - \mb{c}), \] which is called the \textbf{Wald test statistic}. This statistic generalizes the ideas of the t-statistic to multiple parameters. With the t-statistic, we recenter to have mean 0 and divide by the standard error to get a variance of 1. If we ignore the middle variance weighting, we have \((\mb{L}\bhat - \mb{c})'(\mb{L}\bhat - \mb{c})\) which is just the sum of the squared deviations of the estimates from -the null. Including the \((\mb{L}'\mb{V}_{\bfbeta}\mb{L})^{-1}\) weight +the null. Including the \((\mb{L}\mb{V}_{\bfbeta}\mb{L}')^{-1}\) weight has the effect of rescaling the distribution of \(\mb{L}\bhat - \mb{c}\) to make it rotationally symmetric around 0 (so the resulting dimensions are uncorrelated) with each dimension having an equal variance of 1. In @@ -6463,9 +6549,7 @@ \section{Inference for multiple \begin{figure}[th] -\centering{ - -\includegraphics{ols_properties_files/figure-pdf/fig-wald-1.pdf} +{\centering \includegraphics{ols_properties_files/figure-pdf/fig-wald-1.pdf} } @@ -6474,7 +6558,7 @@ \section{Inference for multiple the standard Euclidean distance, but the triangle is closer once you consider the joint distribution.} -\end{figure}% +\end{figure} If \(\mb{L}\) only has one row, our Wald statistic is the same as the squared \(t\) statistic, \(W = t^2\). This fact will help us think about @@ -6499,7 +6583,7 @@ \section{Inference for multiple straightforward (see the above callout tip). For \(q = 2\) and a \(\alpha = 0.05\), the critical value is roughly 6. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Chi-squared critical values}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Chi-squared critical values}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] We can obtain critical values for the \(\chi^2_q\) distribution using the \texttt{qchisq()} function in R. For example, if we wanted to obtain @@ -6564,8 +6648,9 @@ \section{Inference for multiple intercept. In modern quantitative social sciences, this test is seldom substantively interesting. +\hypertarget{finite-sample-properties-with-a-linear-cef}{% \section{Finite-sample properties with a linear -CEF}\label{finite-sample-properties-with-a-linear-cef} +CEF}\label{finite-sample-properties-with-a-linear-cef}} All the above results have been large-sample properties, and we have not addressed finite-sample properties like the sampling variance or @@ -6575,7 +6660,7 @@ \section{Finite-sample properties with a linear properties for OLS. As usual, however, remember that these stronger assumptions can be wrong. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Assumption: Linear Regression Model}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Assumption: Linear Regression Model}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] \begin{enumerate} \def\labelenumi{\arabic{enumi}.} @@ -6622,6 +6707,7 @@ \section{Finite-sample properties with a linear \end{theorem} \begin{proof} + To prove the conditional unbiasedness, recall that we can write the OLS estimator as \[ \bhat = \bfbeta + (\Xmat'\Xmat)^{-1}\Xmat'\mb{e}, @@ -6642,6 +6728,7 @@ \section{Finite-sample properties with a linear \(\sigma^2_i\) along the diagonal, which means \[ \Xmat'\V[\mb{e} \mid \Xmat]\Xmat = \sum_{i=1}^n \sigma^2_i \X_i\X_i', \] establishing the conditional sampling variance. + \end{proof} This means that, for any realization of the covariates, \(\Xmat\), OLS @@ -6681,8 +6768,9 @@ \section{Finite-sample properties with a linear \] Thus, in practice, the asymptotic and finite-sample results under a linear CEF justify the same variance estimator. +\hypertarget{linear-cef-model-under-homoskedasticity}{% \subsection{Linear CEF model under -homoskedasticity}\label{linear-cef-model-under-homoskedasticity} +homoskedasticity}\label{linear-cef-model-under-homoskedasticity}} If we are willing to assume that the standard errors are homoskedastic, we can derive even stronger results for OLS. Stronger assumptions @@ -6692,7 +6780,7 @@ \subsection{Linear CEF model under that statistical software implementations of OLS like \texttt{lm()} in R assume it by default. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Assumption: Homoskedasticity with a linear CEF}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Assumption: Homoskedasticity with a linear CEF}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] In addition to the linear CEF assumption, we further assume that \[ \E[e_i^2 \mid \X_i] = \E[e_i^2] = \sigma^2, @@ -6713,6 +6801,7 @@ \subsection{Linear CEF model under \end{theorem} \begin{proof} + Under homoskedasticity \(\sigma^2_i = \sigma^2\) for all \(i\). Recall that \(\sum_{i=1}^n \X_i\X_i' = \Xmat'\Xmat\). Thus, the conditional sampling variance from Theorem~\ref{thm-ols-unbiased}, \[ @@ -6748,6 +6837,7 @@ \subsection{Linear CEF model under \end{aligned} \] This establishes \(\E[\widehat{\mb{V}}^{\texttt{lm}}_{\bhat} \mid \Xmat] = \mb{V}^{\texttt{lm}}_{\bhat}\). + \end{proof} Thus, under the linear CEF model and homoskedasticity of the errors, we @@ -6776,6 +6866,7 @@ \subsection{Linear CEF model under \end{theorem} \begin{proof} + Note that if \(\widetilde{\bfbeta}\) is unbiased then \(\E[\widetilde{\bfbeta} \mid \Xmat] = \bfbeta\) and so \[ \bfbeta = \E[\mb{AY} \mid \Xmat] = \mb{A}\E[\mb{Y} \mid \Xmat] = \mb{A}\Xmat\bfbeta, @@ -6806,6 +6897,7 @@ \subsection{Linear CEF model under well. The fifth inequality holds because matrix products of the form \(\mb{BB}'\) are positive definite if \(\mb{B}\) is of full rank (which we have assumed it is). + \end{proof} In this proof, we saw that the variance of the competing estimator had @@ -6830,7 +6922,8 @@ \subsection{Linear CEF model under estimators, Hansen (2022) proves a more general version of this result that applies to any unbiased estimator. -\section{The normal linear model}\label{the-normal-linear-model} +\hypertarget{the-normal-linear-model}{% +\section{The normal linear model}\label{the-normal-linear-model}} Finally, we add the strongest and thus least loved of the classical linear regression assumption: (conditional) normality of the errors. @@ -6847,7 +6940,7 @@ \section{The normal linear model}\label{the-normal-linear-model} (conditional) normality of the errors, basically proceeding with some knowledge that we were wrong but hopefully not too wrong. -\begin{tcolorbox}[enhanced jigsaw, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{The normal linear regression model}, breakable, colbacktitle=quarto-callout-note-color!10!white, toptitle=1mm, colback=white, arc=.35mm, left=2mm, opacityback=0, titlerule=0mm, colframe=quarto-callout-note-color-frame, leftrule=.75mm, coltitle=black, opacitybacktitle=0.6, bottomtitle=1mm, rightrule=.15mm, bottomrule=.15mm, toprule=.15mm] +\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{The normal linear regression model}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame] In addition to the linear CEF assumption, we assume that \[ e_i \mid \Xmat \sim \N(0, \sigma^2). @@ -6919,7 +7012,8 @@ \section{The normal linear model}\label{the-normal-linear-model} unclear. But it may be the best we can do while we go and find more data. -\section{Summary}\label{summary-6} +\hypertarget{summary-6}{% +\section{Summary}\label{summary-6}} In this chapter, we discussed the large-sample properties of OLS, which are quite strong. Under mild conditions, OLS is consistent for the @@ -6937,24 +7031,25 @@ \section{Summary}\label{summary-6} \bookmarksetup{startatroot} -\chapter*{References}\label{references} +\hypertarget{references}{% +\chapter*{References}\label{references}} \addcontentsline{toc}{chapter}{References} \markboth{References}{References} -\phantomsection\label{refs} +\hypertarget{refs}{} \begin{CSLReferences}{1}{0} -\bibitem[\citeproctext]{ref-Hansen22} +\leavevmode\vadjust pre{\hypertarget{ref-Hansen22}{}}% Hansen, Bruce E. 2022. {``A {Modern Gauss}--{Markov Theorem}.''} \emph{Econometrica} 90 (3): 1283--94. \url{https://doi.org/10.3982/ECTA19255}. -\bibitem[\citeproctext]{ref-Senn12} +\leavevmode\vadjust pre{\hypertarget{ref-Senn12}{}}% Senn, Stephen. 2012. {``Tea for Three: Of Infusions and Inferences and Milk in First.''} \emph{Significance} 9 (6): 30--33. https://doi.org/\url{https://doi.org/10.1111/j.1740-9713.2012.00620.x}. -\bibitem[\citeproctext]{ref-Squire88} +\leavevmode\vadjust pre{\hypertarget{ref-Squire88}{}}% SQUIRE, PEVERILL. 1988. {``WHY THE 1936 LITERARY DIGEST POLL FAILED.''} \emph{Public Opinion Quarterly} 52 (1): 125--33. \url{https://doi.org/10.1086/269085}.