From edb5d2569b2dd4e1dcb84c25c7a355a2892e029b Mon Sep 17 00:00:00 2001 From: Matt Blackwell Date: Mon, 13 Nov 2023 15:28:07 -0500 Subject: [PATCH 1/3] typo (thanks Liz Darnell) --- 06_linear_model.qmd | 2 +- .../06_linear_model/execute-results/html.json | 4 ++-- .../06_linear_model/execute-results/tex.json | 4 ++-- .../figure-pdf/fig-blp-limits-1.pdf | Bin 5873 -> 5876 bytes .../figure-pdf/fig-cef-binned-1.pdf | Bin 14125 -> 14128 bytes .../figure-pdf/fig-cef-blp-1.pdf | Bin 13967 -> 13970 bytes 6 files changed, 5 insertions(+), 5 deletions(-) diff --git a/06_linear_model.qmd b/06_linear_model.qmd index cd998fd..a81da3d 100644 --- a/06_linear_model.qmd +++ b/06_linear_model.qmd @@ -418,7 +418,7 @@ Here, the coefficients are slightly more interpretable: * $\beta_1$: the marginal effect of $X_{i1}$ on predicted $Y_i$ when $X_{i2} = 0$. * $\beta_2$: the marginal effect of $X_{i2}$ on predicted $Y_i$ when $X_{i1} = 0$. -* $\beta_3$: the change in the marginal effect of $X_{i1}$ due to a one-unit change in $X_{i2}$ **OR** the change in the marginal effect of $X_{i2}$ due to a one-unit change in $X_{i2}$. +* $\beta_3$: the change in the marginal effect of $X_{i1}$ due to a one-unit change in $X_{i2}$ **OR** the change in the marginal effect of $X_{i2}$ due to a one-unit change in $X_{i1}$. If we add more covariates to this BLP, these interpretations change to "holding all other covariates constant." diff --git a/_freeze/06_linear_model/execute-results/html.json b/_freeze/06_linear_model/execute-results/html.json index 8ee94d1..9a8d57a 100644 --- a/_freeze/06_linear_model/execute-results/html.json +++ b/_freeze/06_linear_model/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "462f1dcdf8bf485c66f7cba2bc47ca4d", + "hash": "1771ecc1d57037175f1ca8d02824cb54", "result": { - "markdown": "# Linear regression {#sec-regression}\n\n\nRegression is a tool for assessing the relationship between an **outcome variable**, $Y_i$, and a set of **covariates**, $\\X_i$. In particular, these tools show how the conditional mean of $Y_i$ varies as a function of $\\X_i$. For example, we may want to know how voting poll wait times vary as a function of some socioeconomic features of the precinct, like income and racial composition. We usually accomplish this task by estimating the **regression function** or **conditional expectation function** (CEF) of the outcome given the covariates, \n$$\n\\mu(\\bfx) = \\E[Y_i \\mid \\X_i = \\bfx].\n$$\nWhy are estimation and inference for this regression function special? Why can't we just use the approaches we have seen for the mean, variance, covariance, and so on? The fundamental problem with the CEF is that there may be many, many values $\\bfx$ that can occur and many different conditional expectations that we will need to estimate. If any variable in $\\X_i$ is continuous, we must estimate an infinite number of possible values of $\\mu(\\bfx)$. Because it worsens as we add covariates to $\\X_i$, we refer to this problem as the **curse of dimensionality**. How can we resolve this with our measly finite data?\n\nIn this chapter, we will explore two ways of \"solving\" the curse of dimensionality: assuming it away and changing the quantity of interest to something easier to estimate. \n\n\nRegression is so ubiquitous in many scientific fields that it has a lot of acquired notational baggage. In particular, the labels of the $Y_i$ and $\\X_i$ vary greatly:\n\n- The outcome can also be called: the response variable, the dependent variable, the labels (in machine learning), the left-hand side variable, or the regressand. \n- The covariates are also called: the explanatory variables, the independent variables, the predictors, the regressors, inputs, or features. \n\n\n## Why do we need models?\n\nAt first glance, the connection between the CEF and parametric models might be hazy. For example, imagine we are interested in estimating the average poll wait times ($Y_i$) for Black voters ($X_i = 1$) versus non-Black voters ($X_i=0$). In that case, there are two parameters to estimate, \n$$\n\\mu(1) = \\E[Y_i \\mid X_i = 1] \\quad \\text{and}\\quad \\mu(0) = \\E[Y_i \\mid X_i = 0],\n$$\nwhich we could estimate by using the plug-in estimators that replace the population averages with their sample counterparts,\n$$ \n\\widehat{\\mu}(1) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 1)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 1)} \\qquad \\widehat{\\mu}(0) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 0)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 0)}.\n$$\nThese are just the sample averages of the wait times for Black and non-Black voters, respectively. And because the race variable here is discrete, we are simply estimating sample means within subpopulations defined by race. The same logic would apply if we had $k$ racial categories: we would have $k$ conditional expectations to estimate and $k$ (conditional) sample means. \n\nNow imagine that we want to know how the average poll wait time varies as a function of income so that $X_i$ is (essentially) continuous. Now we have a different conditional expectation for every possible dollar amount from 0 to Bill Gates's income. Imagine we pick a particular income, \\$42,238, and so we are interested in the conditional expectation $\\mu(42,238)= \\E[Y_{i}\\mid X_{i} = 42,238]$. We could use the same plug-in estimator in the discrete case, \n$$\n\\widehat{\\mu}(42,238) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 42,238)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 42,238)}.\n$$\nWhat is the problem with this estimator? In all likelihood, no units in any particular dataset have that exact income, meaning this estimator is undefined (we would be dividing by zero). \n\n\nOne solution to this problem is to use **subclassification**, turn the continuous variable into a discrete one, and proceed with the discrete approach above. We might group incomes into \\$25,000 bins and then calculate the average wait times of anyone between, say, \\$25,000 and \\$50,000 income. When we make this estimator switch for practical purposes, we need to connect it back to the DGP of interest. We could **assume** that the CEF of interest only depends on these binned means, which would mean we have: \n$$\n\\mu(x) = \n\\begin{cases}\n \\E[Y_{i} \\mid 0 \\leq X_{i} < 25,000] &\\text{if } 0 \\leq x < 25,000 \\\\\n \\E[Y_{i} \\mid 25,000 \\leq X_{i} < 50,000] &\\text{if } 25,000 \\leq x < 50,000\\\\\n \\E[Y_{i} \\mid 50,000 \\leq X_{i} < 100,000] &\\text{if } 50,000 \\leq x < 100,000\\\\\n \\vdots \\\\\n \\E[Y_{i} \\mid 200,000 \\leq X_{i}] &\\text{if } 200,000 \\leq x\\\\\n\\end{cases}\n$$\nThis approach assumes, perhaps incorrectly, that the average wait time does not vary within the bins. @fig-cef-binned shows a hypothetical joint distribution between income and wait times with the true CEF, $\\mu(x)$, shown in red. The figure also shows the bins created by subclassification and the implied CEF if we assume bin-constant means in blue. We can see that the blue function approximates the true CEF but deviates from it close to the bin edges. The trade-off is that once we make the assumption, we only have to estimate one mean for every bin rather than an infinite number of means for each possible income. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of income and poll wait times (contour plot), conditional expectation function (red), and the conditional expectation of the binned income (blue).](06_linear_model_files/figure-html/fig-cef-binned-1.png){#fig-cef-binned width=672}\n:::\n:::\n\n\nSimilarly, we could **assume** that the CEF follows a simple functional form like a line,\n$$ \n\\mu(x) = \\E[Y_{i}\\mid X_{i} = x] = \\beta_{0} + \\beta_{1} x.\n$$\nThis assumption reduces our infinite number of unknowns (the conditional mean at every possible income) to just two unknowns: the slope and intercept. As we will see, we can use the standard ordinary least squares to estimate these parameters. Notice again that if the true CEF is nonlinear, this assumption is incorrect, and any estimate based on this assumption might be biased or even inconsistent. \n\nWe call the binning and linear assumptions on $\\mu(x)$ **functional form** assumptions because they restrict the class of functions that $\\mu(x)$ can take. While powerful, these types of assumptions can muddy the roles of defining the quantity of interest and estimation. If our estimator $\\widehat{\\mu}(x)$ performs poorly, it will be difficult to tell if this is because the estimator is flawed or our functional form assumptions are incorrect. \n\nTo help clarify these issues, we will pursue a different approach: understanding what linear regression can estimate under minimal assumptions and then investigating how well this estimand approximates the true CEF. \n\n## Population linear regression {#sec-linear-projection}\n\n### Bivariate linear regression \n\n\nLet's set aside the idea of the conditional expectation function and instead focus on finding the **linear** function of a single covariate $X_i$ that best predicts the outcome. Remember that linear functions have the form $a + bX_i$. The **best linear predictor** (BLP) or **population linear regression** of $Y_i$ on $X_i$ is defined as\n$$ \nm(x) = \\beta_0 + \\beta_1 x \\quad\\text{where, }\\quad (\\beta_{0}, \\beta_{1}) = \\argmin_{(b_{0}, b_{1}) \\in \\mathbb{R}^{2}}\\; \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}].\n$$\nThat is, the best linear predictor is the line that results in the lowest mean-squared error predictions of the outcome given the covariates, averaging over the joint distribution of the data. This function is a feature of the joint distribution of the data---the DGP---and so represents something that we would like to learn about with our sample. It is an alternative to the CEF for summarizing the relationship between the outcome and the covariate, though we will see that they will sometimes be equal. We call $(\\beta_{0}, \\beta_{1})$ the **population linear regression coefficients**. Notice that $m(x)$ could differ greatly from the CEF $\\mu(x)$ if the latter is nonlinear. \n\nWe can solve for the best linear predictor using standard calculus (taking the derivative with respect to each coefficient, setting those equations equal to 0, and solving the system of equations). The first-order conditions, in this case, are\n$$ \n\\begin{aligned}\n \\frac{\\partial \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}]}{\\partial b_{0}} = \\E[-2(Y_{i} - \\beta_{0} - \\beta_{1}X_{i})] = 0 \\\\\n \\frac{\\partial \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}]}{\\partial b_{1}} = \\E[-2(Y_{i} - \\beta_{0} - \\beta_{1}X_{i})X_{i}] = 0\n\\end{aligned} \n$$\nGiven the linearity of expectations, it is easy to solve for $\\beta_0$ in terms of $\\beta_1$,\n$$ \n\\beta_{0} = \\E[Y_{i}] - \\beta_{1}\\E[X_{i}].\n$$\nWe can plug this into the first-order condition for $\\beta_1$ to get\n$$ \n\\begin{aligned}\n 0 &= \\E[Y_{i}X_{i}] - (\\E[Y_{i}] - \\beta_{1}\\E[X_{i}])\\E[X_{i}] - \\beta_{1}\\E[X_{i}^{2}] \\\\\n &= \\E[Y_{i}X_{i}] - \\E[Y_{i}]\\E[X_{i}] - \\beta_{1}(\\E[X_{i}^{2}] - \\E[X_{i}]^{2}) \\\\\n &= \\cov(X_{i},Y_{i}) - \\beta_{1}\\V[X_{i}]\\\\\n \\beta_{1} &= \\frac{\\cov(X_{i},Y_{i})}{\\V[X_{i}]}\n\\end{aligned}\n$$\n\nThus the slope on the population linear regression of $Y_i$ on $X_i$ is equal to the ratio of the covariance of the two variables divided by the variance of $X_i$. From this, we can immediately see that the covariance will determine the sign of the slope: positive covariances will lead to positive $\\beta_1$ and negative covariances will lead to negative $\\beta_1$. In addition, we can see that if $Y_i$ and $X_i$ are independent, $\\beta_1 = 0$. The slope scales this covariance by the variance of the covariate, so slopes are lower for more spread-out covariates and higher for more spread-out covariates. If we define the correlation between these variables as $\\rho_{YX}$, then we can relate the coefficient to this quantity as \n$$\n\\beta_1 = \\rho_{YX}\\sqrt{\\frac{\\V[Y_i]}{\\V[X_i]}}.\n$$\n\nCollecting together our results, we can write the population linear regression as \n$$\nm(x) = \\beta_0 + \\beta_1x = \\E[Y_i] + \\beta_1(x - \\E[X_i]),\n$$\nwhich shows how we adjust our best guess about $Y_i$ from the mean of the outcome using the covariate. \n\nIt's important to remember that the BLP, $m(x)$, and the CEF, $\\mu(x)$, are distinct entities. If the CEF is nonlinear, as in @fig-cef-blp, there will be a difference between these functions, meaning that the BLP might produce subpar predictions. Below, we will derive a formal connection between the BLP and the CEF. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Comparison of the CEF and the best linear predictor.](06_linear_model_files/figure-html/fig-cef-blp-1.png){#fig-cef-blp width=672}\n:::\n:::\n\n\n\n### Beyond linear approximations\n\nThe linear part of the best linear predictor is less restrictive than at first glance. We can easily modify the minimum MSE problem to find the best quadratic, cubic, or general polynomial function of $X_i$ that predicts $Y_i$. For example, the quadratic function of $X_i$ that best predicts $Y_i$ would be\n$$ \nm(X_i, X_i^2) = \\beta_0 + \\beta_1X_i + \\beta_2X_i^2 \\quad\\text{where}\\quad \\argmin_{(b_0,b_1,b_2) \\in \\mathbb{R}^3}\\;\\E[(Y_{i} - b_{0} - b_{1}X_{i} - b_{2}X_{i}^{2})^{2}].\n$$\nThis equation is now a quadratic function of the covariates, but it is still a linear function of the unknown parameters $(\\beta_{0}, \\beta_{1}, \\beta_{2})$ so we will call this a best linear predictor. \n\nWe could include higher order terms of $X_i$ in the same manner, and as we include more polynomial terms, $X_i^p$, the more flexible the function of $X_i$ we will capture with the BLP. When we estimate the BLP, however, we usually will pay for this flexibility in terms of overfitting and high variance in our estimates. \n\n\n### Linear prediction with multiple covariates \n\nWe now generalize the idea of a best linear predictor to a setting with an arbitrary number of covariates. In this setting, remember that the linear function will be\n\n$$ \n\\bfx'\\bfbeta = x_{1}\\beta_{1} + x_{2}\\beta_{2} + \\cdots + x_{k}\\beta_{k}.\n$$\nWe will define the **best linear predictor** (BLP) to be\n$$ \nm(\\bfx) = \\bfx'\\bfbeta, \\quad \\text{where}\\quad \\bfbeta = \\argmin_{\\mb{b} \\in \\real^k}\\; \\E\\bigl[ \\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\\bigr]\n$$\n\nThis BLP solves the same fundamental optimization problem as in the bivariate case: it chooses the set of coefficients that minimizes the mean-squared error averaging over the joint distribution of the data. \n\n\n\n::: {.callout-note}\n## Best linear projection assumptions\n\nWithout some assumptions on the joint distribution of the data, the following \"regularity conditions\" will ensure the existence of the BLP:\n\n1. $\\E[Y^2] < \\infty$ (outcome has finite mean/variance)\n2. $\\E\\Vert \\mb{X} \\Vert^2 < \\infty$ ($\\mb{X}$ has finite means/variances/covariances)\n3. $\\mb{Q}_{\\mb{XX}} = \\E[\\mb{XX}']$ is positive definite (columns of $\\X$ are linearly independent) \n:::\n\nUnder these assumptions, it is possible to derive a closed-form expression for the **population coefficients** $\\bfbeta$ using matrix calculus. To set up the optimization problem, we will find the first-order condition by taking the derivative of the expectation of the squared errors. First, let's take the derivative of the squared prediction errors using the chain rule:\n$$ \n\\begin{aligned}\n \\frac{\\partial}{\\partial \\mb{b}}\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)^{2}\n &= 2\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)\\frac{\\partial}{\\partial \\mb{b}}(Y_{i} - \\X_{i}'\\mb{b}) \\\\\n &= -2\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)\\X_{i} \\\\\n &= -2\\X_{i}\\left(Y_{i} - \\X_{i}'\\mb{b}\\right) \\\\\n &= -2\\left(\\X_{i}Y_{i} - \\X_{i}\\X_{i}'\\mb{b}\\right),\n\\end{aligned}\n$$\nwhere the third equality comes from the fact that $(Y_{i} - \\X_{i}'\\bfbeta)$ is a scalar. We can now plug this into the expectation to get the first-order condition and solve for $\\bfbeta$,\n$$ \n\\begin{aligned}\n 0 &= -2\\E[\\X_{i}Y_{i} - \\X_{i}\\X_{i}'\\bfbeta ] \\\\\n \\E[\\X_{i}\\X_{i}'] \\bfbeta &= \\E[\\X_{i}Y_{i}],\n\\end{aligned}\n$$\nwhich implies the population coefficients are\n$$ \n\\bfbeta = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] = \\mb{Q}_{\\mb{XX}}^{-1}\\mb{Q}_{\\mb{X}Y}\n$$\nWe now have an expression for the coefficients for the population best linear predictor in terms of the joint distribution $(Y_{i}, \\X_{i})$. A couple of facts might be useful for reasoning this expression. Recall that $\\mb{Q}_{\\mb{XX}} = \\E[\\X_{i}\\X_{i}']$ is a $k\\times k$ matrix and $\\mb{Q}_{\\X Y} = \\E[\\X_{i}Y_{i}]$ is a $k\\times 1$ column vector, which implies that $\\bfbeta$ is also a $k \\times 1$ column vector. \n\n::: {.callout-note}\n\nIntuitively, what is happening in the expression for the population regression coefficients? It is helpful to separate the intercept or constant term so that we have\n$$ \nY_{i} = \\beta_{0} + \\X'\\bfbeta + e_{i},\n$$\nso $\\bfbeta$ refers to just the vector of coefficients for the covariates. In this case, we can write the coefficients in a more interpretable way:\n$$ \n\\bfbeta = \\V[\\X]^{-1}\\text{Cov}(\\X, Y), \\qquad \\beta_0 = \\mu_Y - \\mb{\\mu}'_{\\mb{X}}\\bfbeta\n$$\n\nThus, the population coefficients take the covariance between the outcome and the covariates and \"divide\" it by information about variances and covariances of the covariates. The intercept recenters the regression so that projection errors are mean zero. Thus, we can see that these coefficients generalize the bivariate formula to this multiple covariate context. \n:::\n\nWith an expression for the population linear regression coefficients, we can write the linear projection as \n$$ \nm(\\X_{i}) = \\X_{i}'\\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] = \\X_{i}'\\mb{Q}_{\\mb{XX}}^{-1}\\mb{Q}_{\\mb{X}Y}\n$$\n\n\n\n### Projection error\n\nThe **projection error** is the difference between the actual value of $Y_i$ and the projection,\n$$ \ne_{i} = Y_{i} - m(\\X_{i}) = Y_i - \\X_{i}'\\bfbeta,\n$$\nwhere we have made no assumptions about this error yet. The projection error is simply the prediction error of the best linear prediction. Rewriting this definition, we can see that we can always write the outcome as the linear projection plus the projection error,\n$$ \nY_{i} = \\X_{i}'\\bfbeta + e_{i}.\n$$\nNotice that this looks suspiciously similar to a linearity assumption on the CEF, but we haven't made any assumptions here. Instead, we have just used the definition of the projection error to write a tautological statement: \n$$ \nY_{i} = \\X_{i}'\\bfbeta + e_{i} = \\X_{i}'\\bfbeta + Y_{i} - \\X_{i}'\\bfbeta = Y_{i}.\n$$\nThe critical difference between this representation and the usual linear model assumption is what properties $e_{i}$ possesses. \n\nOne key property of the projection errors is that when the covariate vector includes an \"intercept\" or constant term, the projection errors are uncorrelated with the covariates. To see this, we first note that $\\E[\\X_{i}e_{i}] = 0$ since\n$$ \n\\begin{aligned}\n \\E[\\X_{i}e_{i}] &= \\E[\\X_{{i}}(Y_{i} - \\X_{i}'\\bfbeta)] \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}\\X_{i}']\\bfbeta \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}\\X_{i}']\\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}Y_{i}] = 0\n\\end{aligned}\n$$\nThus, for every $X_{ij}$ in $\\X_{i}$, we have $\\E[X_{ij}e_{i}] = 0$. If one of the entries in $\\X_i$ is a constant 1, then this also implies that $\\E[e_{i}] = 0$. Together, these facts imply that the projection error is uncorrelated with each $X_{ij}$, since\n$$ \n\\cov(X_{ij}, e_{i}) = \\E[X_{ij}e_{i}] - \\E[X_{ij}]\\E[e_{i}] = 0 - 0 = 0\n$$\nNotice that we still have made no assumptions about these projection errors except for some mild regularity conditions on the joint distribution of the outcome and covariates. Thus, in very general settings, we can write the linear projection model $Y_i = \\X_i'\\bfbeta + e_i$ where $\\bfbeta = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}]$ and conclude that $\\E[\\X_{i}e_{i}] = 0$ by definition, not by assumption. \n\nThe projection error is uncorrelated with the covariates, so does this mean that the CEF is linear? Unfortunately, no. Recall that while independence implies uncorrelated, the reverse does not hold. So when we look at the CEF, we have\n$$ \n\\E[Y_{i} \\mid \\X_{i}] = \\X_{i}'\\bfbeta + \\E[e_{i} \\mid \\X_{i}],\n$$\nand the last term $\\E[e_{i} \\mid \\X_{i}]$ would only be 0 if the errors were independent of the covariates, so $\\E[e_{i} \\mid \\X_{i}] = \\E[e_{i}] = 0$. But nowhere in the linear projection model did we assume this. So while we can (almost) always write the outcome as $Y_i = \\X_i'\\bfbeta + e_i$ and have those projection errors be uncorrelated with the covariates, it will require additional assumptions to ensure that the true CEF is, in fact, linear $\\E[Y_{i} \\mid \\X_{i}] = \\X_{i}'\\bfbeta$. \n\nLet's take a step back. What have we shown here? In a nutshell, we have shown that a population linear regression exists under very general conditions, and we can write the coefficients of that population linear regression as a function of expectations of the joint distribution of the data. We did not assume that the CEF was linear nor that the projection errors were normal. \n\n\nWhy do we care about this? The ordinary least squares estimator, the workhorse regression estimator, targets this quantity of interest in large samples, regardless of whether the true CEF is linear or not. Thus, even when a linear CEF assumption is incorrect, OLS still targets a perfectly valid quantity of interest: the coefficients from this population linear projection. \n\n## Linear CEFs without assumptions\n\nWhat is the relationship between the best linear predictor (which we just saw generally exists) and the CEF? To draw the connection, remember a vital property of the conditional expectation: it is the function of $\\X_i$ that best predicts $Y_{i}$. The population regression was the best **linear** predictor, but the CEF is the best predictor among all nicely behaved functions of $\\X_{i}$, linear or nonlinear. In particular, if we label $L_2$ to be the set of all functions of the covariates $g()$ that have finite squared expectation, $\\E[g(\\X_{i})^{2}] < \\infty$, then we can show that the CEF has the lowest squared prediction error in this class of functions:\n$$ \n\\mu(\\X) = \\E[Y_{i} \\mid \\X_{i}] = \\argmin_{g(\\X_i) \\in L_2}\\; \\E\\left[(Y_{i} - g(\\X_{i}))^{2}\\right],\n$$\n\nSo we have established that the CEF is the best predictor and the population linear regression $m(\\X_{i})$ is the best linear predictor. These two facts allow us to connect the CEF and the population regression.\n\n::: {#thm-cef-blp}\nIf $\\mu(\\X_{i})$ is a linear function of $\\X_i$, then $\\mu(\\X_{i}) = m(\\X_{i}) = \\X_i'\\bfbeta$. \n\n:::\n\nThis theorem says that if the true CEF is linear, it equals the population linear regression. The proof of this is straightforward: the CEF is the best predictor, so if it is linear, it must also be the best linear predictor. \n \n \nIn general, we are in the business of learning about the CEF, so we are unlikely to know if it genuinely is linear or not. In some situations, however, we can show that the CEF is linear without any additional assumptions. These will be situations when the covariates take on a finite number of possible values. Suppose we are interested in the CEF of poll wait times for Black ($X_i = 1$) vs. non-Black ($X_i = 0$) voters. In this case, there are two possible values of the CEF, $\\mu(1) = \\E[Y_{i}\\mid X_{i}= 1]$, the average wait time for Black voters, and $\\mu(0) = \\E[Y_{i}\\mid X_{i} = 0]$, the average wait time for non-Black voters. Notice that we can write the CEF as\n$$ \n\\mu(x) = x \\mu(1) + (1 - x) \\mu(0) = \\mu(0) + x\\left(\\mu(1) - \\mu(0)\\right)= \\beta_0 + x\\beta_1,\n$$\nwhich is clearly a linear function of $x$. Based on this derivation, we can see that the coefficients of this linear CEF have a clear interpretation:\n\n- $\\beta_0 = \\mu(0)$: the expected wait time for a Black voter. \n- $\\beta_1 = \\mu(1) - \\mu(0)$: the difference in average wait times between Black and non-Black voters. \nNotice that it matters how $X_{i}$ is defined here since the intercept will always be the average outcome when $X_i = 0$, and the slope will always be the difference in means between the $X_i = 1$ group and the $X_i = 0$ group. \n\nWhat about a categorical covariate with more than two levels? For instance, we might be interested in wait times by party identification, where $X_i = 1$ indicates Democratic voters, $X_i = 2$ indicates Republican voters, and $X_i = 3$ indicates independent voters. How can we write the CEF of wait times as a linear function of this variable? That would assume that the difference between Democrats and Republicans is the same as for Independents and Republicans. With more than two levels, we can represent a categorical variable as a vector of binary variables, $\\X_i = (X_{i1}, X_{i2})$, where\n$$ \n\\begin{aligned}\n X_{{i1}} &= \\begin{cases}\n 1&\\text{if Republican} \\\\\n 0 & \\text{if not Republican}\n \\end{cases} \\\\\nX_{{i2}} &= \\begin{cases}\n 1&\\text{if independent} \\\\\n 0 & \\text{if not independent}\n \\end{cases} \\\\\n\\end{aligned}\n$$\nThese two indicator variables encode the same information as the original three-level variable, $X_{i}$. If I know the values of $X_{i1}$ and $X_{i2}$, I know exactly what party to which $i$ belongs. Thus, the CEFs for $X_i$ and the pair of indicator variables, $\\X_i$, are precisely the same, but the latter admits a lovely linear representation,\n$$\n\\E[Y_i \\mid X_{i1}, X_{i2}] = \\beta_0 + \\beta_1 X_{i1} + \\beta_2 X_{i2},\n$$\nwhere\n\n- $\\beta_0 = \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the average wait time for the group who does not get an indicator variable (Democrats in this case). \n- $\\beta_1 = \\E[Y_{i} \\mid X_{i1} = 1, X_{i2} = 0] - \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the difference in means between Republican voters and Democratic voters, or the difference between the first indicator group and the baseline group. \n- $\\beta_2 = \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 1] - \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the difference in means between independent voters and Democratic voters, or the difference between the second indicator group and the baseline group.\n\nThis approach easily generalizes to categorical variables with an arbitrary number of levels. \n\nWhat have we shown? The CEF will be linear without additional assumptions when there is a categorical covariate. We can show that this continues to hold even when we have multiple categorical variables. We now have two binary covariates: $X_{i1}=1$ indicating a Black voter, and $X_{i2} = 1$ indicating an urban voter. With these two binary variables, there are four possible values of the CEF:\n$$ \n\\mu(x_1, x_2) = \\begin{cases} \n \\mu_{00} & \\text{if } x_1 = 0 \\text{ and } x_2 = 0 \\text{ (non-Black, rural)} \\\\\n \\mu_{10} & \\text{if } x_1 = 1 \\text{ and } x_2 = 0 \\text{ (Black, rural)} \\\\\n \\mu_{01} & \\text{if } x_1 = 0 \\text{ and } x_2 = 1 \\text{ (non-Black, urban)} \\\\\n \\mu_{11} & \\text{if } x_1 = 1 \\text{ and } x_2 = 1 \\text{ (Black, urban)}\n \\end{cases}\n$$\nWe can write this as\n$$ \n\\mu(x_{1}, x_{2}) = (1 - x_{1})(1 - x_{2})\\mu_{00} + x_{1}(1 -x_{2})\\mu_{10} + (1-x_{1})x_{2}\\mu_{01} + x_{1}x_{2}\\mu_{11},\n$$\nwhich we can rewrite as \n$$ \n\\mu(x_1, x_2) = \\beta_0 + x_1\\beta_1 + x_2\\beta_2 + x_1x_2\\beta_3,\n$$\nwhere\n\n- $\\beta_0 = \\mu_{00}$: average wait times for rural non-Black voters. \n- $\\beta_1 = \\mu_{10} - \\mu_{00}$: difference in means for rural Black vs. rural non-Black voters. \n- $\\beta_2 = \\mu_{01} - \\mu_{00}$: difference in means for urban non-Black vs. rural non-Black voters. \n- $\\beta_3 = (\\mu_{11} - \\mu_{01}) - (\\mu_{10} - \\mu_{00})$: difference in urban racial difference vs rural racial difference.\n\nThus, we can write the CEF with two binary covariates as linear when the linear specification includes a multiplicative interaction between them ($x_1x_2$). This result holds for all pairs of binary covariates, and we can generalize the interpretation of the coefficients in the CEF as\n\n- $\\beta_0 = \\mu_{00}$: average outcome when both variables are 0. \n- $\\beta_1 = \\mu_{10} - \\mu_{00}$: difference in average outcomes for the first covariate when the second covariate is 0. \n- $\\beta_2 = \\mu_{01} - \\mu_{00}$: difference in average outcomes for the second covariate when the first covariate is 0. \n- $\\beta_3 = (\\mu_{11} - \\mu_{01}) - (\\mu_{10} - \\mu_{00})$: change in the \"effect\" of the first (second) covariate when the second (first) covariate goes from 0 to 1. \n\nThis result also generalizes to an arbitrary number of binary covariates. If we have $p$ binary covariates, then the CEF will be linear with all two-way interactions, $x_1x_2$, all three-way interactions, $x_1x_2x_3$, up to the $p$-way interaction $x_1\\times\\cdots\\times x_p$. Furthermore, we can generalize to arbitrary numbers of categorical variables by expanding each into a series of binary variables and then including all interactions between the resulting binary variables. \n\n\nWe have established that when we have a set of categorical covariates, the true CEF will be linear, and we have seen the various ways to represent that CEF. Notice that when we use, for example, ordinary least squares, we are free to choose how to include our variables. That means that we could run a regression of $Y_i$ on $X_{i1}$ and $X_{i2}$ without an interaction term. This model will only be correct if $\\beta_3$ is equal to 0, and so the interaction term is irrelevant. Because of this ability to choose our models, it's helpful to have a language for models that capture the linear CEF appropriately. We call a model **saturated** if there are as many coefficients as the CEF's unique values. A saturated model, by its nature, can always be written as a linear function without assumptions. The above examples show how to construct saturated models in various situations.\n\n## Interpretation of the regression coefficients\n\nWe have seen how to interpret population regression coefficients when the CEF is linear without assumptions. How do we interpret the population coefficients $\\bfbeta$ in other settings? \n\n\nLet's start with the simplest case, where every entry in $\\X_{i}$ represents a different covariate and no covariate is any function of another (we'll see why this caveat is necessary below). In this simple case, the $k$th coefficient, $\\beta_{k}$, will represent the change in the predicted outcome for a one-unit change in the $k$th covariate $X_{ik}$, holding all other covariates fixed. We can see this from \n$$ \n\\begin{aligned}\n m(x_{1} + 1, x_{2}) & = \\beta_{0} + \\beta_{1}(x_{1} + 1) + \\beta_{2}x_{2} \\\\\n m(x_{1}, x_{2}) &= \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{2},\n\\end{aligned} \n$$\nso that the change in the predicted outcome for increasing $X_{i1}$ by one unit is\n$$\n m(x_{1} + 1, x_{2}) - m(x_{1}, x_{2}) = \\beta_1\n$$\nNotice that nothing changes in this interpretation if we add more covariates to the vector,\n$$\n m(x_{1} + 1, \\bfx_{2}) - m(x_{1}, \\bfx_{2}) = \\beta_1,\n$$\nthe coefficient on a particular variable is the change in the predicted outcome for a one-unit change in the covariate holding all other covariates constant. Each coefficient summarizes the \"all else equal\" difference in the predicted outcome for each covariate. \n\n\n### Polynomial functions of the covariates\n\n\n\nThe interpretation of the population regression coefficients becomes more complicated when we include nonlinear functions of the covariates. In that case, multiple coefficients control how a change in a covariate will change the predicted value of $Y_i$. Suppose that we have a quadratic function of $X_{i1}$,\n$$ \nm(x_1, x_1^2, x_{2}) = \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{1}^{2} + \\beta_{3}x_{2},\n$$\nand try to look at a one-unit change in $x_1$,\n$$ \n\\begin{aligned}\n m(x_{1} + 1, (x_{1} + 1)^{2}, x_{2}) & = \\beta_{0} + \\beta_{1}(x_{1} + 1) + \\beta_{2}(x_{1} + 1)^{2}+ \\beta_{3}x_{2} \\\\\n m(x_{1}, x_{1}^{2}, x_{2}) &= \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{1}^{2} + \\beta_{3}x_{2},\n\\end{aligned} \n$$\nresulting in $\\beta_1 + \\beta_2(2x_{1} + 1)$. This formula might be an interesting quantity, but we will more commonly use the derivative of $m(\\bfx)$ with respect to $x_1$ as a measure of the marginal effect of $X_{i1}$ on the predicted value of $Y_i$ (holding all other variables constant), where \"marginal\" here means the change in prediction for a very small change in $X_{i1}$.[^effect] In the case of the quadratic covariate, we have\n$$ \n\\frac{\\partial m(x_{1}, x_{1}^{2}, x_{2})}{\\partial x_{1}} = \\beta_{1} + 2\\beta_{2}x_{1},\n$$\nso the marginal effect on prediction varies as a function of $x_1$. From this, we can see that the individual interpretations of the coefficients are less interesting: $\\beta_1$ is the marginal effect when $X_{i1} = 0$ and $\\beta_2 / 2$ describes how a one-unit change in $X_{i1}$ changes the marginal effect. As is hopefully clear, it will often be more straightforward to visualize the nonlinear predictor function (perhaps using the orthogonalization techniques in @sec-fwl). \n\n\n[^effect]: Notice the choice of language here. The marginal effect is on the predicted value of $Y_i$, not on $Y_i$ itself. So these marginal effects are associational, not necessarily causal quantities. \n\n### Interactions\n\nAnother common nonlinear function of the covariates is when we include **interaction terms** or covariates that are products of two other covariates,\n$$ \nm(x_{1}, x_{2}, x_{1}x_{2}) = \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{2} + \\beta_{3}x_{1}x_{2}.\n$$\nIn these situations, we can also use the derivative of the BLP to measure the marginal effect of one variable or the other on the predicted value of $Y_i$. In particular, we have\n$$ \n\\begin{aligned}\n \\frac{\\partial m(x_{1}, x_{2}, x_{1}x_{2})}{\\partial x_1} &= \\beta_1 + \\beta_3x_2, \\\\\n \\frac{\\partial m(x_{1}, x_{2}, x_{1}x_{2})}{\\partial x_2} &= \\beta_2 + \\beta_3x_1.\n\\end{aligned}\n$$\nHere, the coefficients are slightly more interpretable:\n\n* $\\beta_1$: the marginal effect of $X_{i1}$ on predicted $Y_i$ when $X_{i2} = 0$.\n* $\\beta_2$: the marginal effect of $X_{i2}$ on predicted $Y_i$ when $X_{i1} = 0$.\n* $\\beta_3$: the change in the marginal effect of $X_{i1}$ due to a one-unit change in $X_{i2}$ **OR** the change in the marginal effect of $X_{i2}$ due to a one-unit change in $X_{i2}$.\n\nIf we add more covariates to this BLP, these interpretations change to \"holding all other covariates constant.\"\n\nInteractions are a routine part of social science research because they allow us to assess how the relationship between the outcome and an independent variable varies by the values of another variable. In the context of our study of voter wait times, if $X_{i1}$ is income and $X_{i2}$ is the Black/non-Black voter indicator, then $\\beta_3$ represents the change in the slope of the wait time-income relationship between Black and non-Black voters. \n\n\n## Multiple regression from bivariate regression {#sec-fwl}\n\nWhen we have a regression of an outcome on two covariates, it is helpful to understand how the coefficients of one variable relate to the other. For example, if we have the following best linear projection:\n$$ \n(\\alpha, \\beta, \\gamma) = \\argmin_{(a,b,c) \\in \\mathbb{R}^{3}} \\; \\E[(Y_{i} - (a + bX_{i} + cZ_{i}))^{2}]\n$$ {#eq-two-var-blp}\nIs there some way to understand the $\\beta$ coefficient here regarding simple linear regression? As it turns out, yes. From the above results, we know that the intercept has a simple form:\n$$\n\\alpha = \\E[Y_i] - \\beta\\E[X_i] - \\gamma\\E[Z_i].\n$$\nLet's investigate the first order condition for $\\beta$:\n$$ \n\\begin{aligned}\n 0 &= \\E[Y_{i}X_{i}] - \\alpha\\E[X_{i}] - \\beta\\E[X_{i}^{2}] - \\gamma\\E[X_{i}Z_{i}] \\\\\n &= \\E[Y_{i}X_{i}] - \\E[Y_{i}]\\E[X_{i}] + \\beta\\E[X_{i}]^{2} + \\gamma\\E[X_{i}]\\E[Z_{i}] - \\beta\\E[X_{i}^{2}] - \\gamma\\E[X_{i}Z_{i}] \\\\\n &= \\cov(Y, X) - \\beta\\V[X_{i}] - \\gamma \\cov(X_{i}, Z_{i})\n\\end{aligned}\n$$\nWe can see from this that if $\\cov(X_{i}, Z_{i}) = 0$, then the coefficient on $X_i$ will be the same as in the simple regression case, $\\cov(Y_{i}, X_{i})/\\V[X_{i}]$. When $X_i$ and $Z_i$ are uncorrelated, we sometimes call them **orthogonal**. \n\nTo write a simple formula for $\\beta$ when the covariates are not orthogonal, we will **orthogonalize** $X_i$ by obtaining the prediction errors from a population linear regression of $X_i$ on $Z_i$:\n$$ \n\\widetilde{X}_{i} = X_{i} - (\\delta_{0} + \\delta_{1}Z_{i}) \\quad\\text{where}\\quad (\\delta_{0}, \\delta_{1}) = \\argmin_{(d_{0},d_{1}) \\in \\mathbb{R}^{2}} \\; \\E[(X_{i} - (d_{0} + d_{1}Z_{i}))^{2}]\n$$\nGiven the properties of projection errors, we know that this orthogonalized version of $X_{i}$ will be uncorrelated with $Z_{i}$ since $\\E[\\widetilde{X}_{i}Z_{i}] = 0$. Remarkably, the coefficient on $X_i$ from the \"long\" BLP in @eq-two-var-blp is the same as the regression of $Y_i$ on this orthogonalized $\\widetilde{X}_i$, \n$$ \n\\beta = \\frac{\\text{cov}(Y_{i}, \\widetilde{X}_{i})}{\\V[\\widetilde{X}_{i}]}\n$$\n\nWe can expand this idea to when there are several other covariates. Suppose now that we are interested in a regression of $Y_i$ on $\\X_i$ and we are interested in the coefficient on the $k$th covariate. Let $\\X_{i,-k}$ be the vector of covariates omitting the $k$th entry and let $m_k(\\X_{i,-k})$ represent the BLP of $X_{ik}$ on these other covariates. We can define $\\widetilde{X}_{ik} = X_{ik} - m_{k}(\\X_{i,-k})$ as the $k$th variable orthogonalized with respect to the rest of the variables and we can write the coefficient on $X_{ik}$ as\n$$ \n\\beta_k = \\frac{\\cov(Y_i, \\widetilde{X}_{ik})}{\\V[\\widetilde{X}_{ik}]}.\n$$ \nThus, the population regression coefficient in the BLP is the same as from a bivariate regression of the outcome on the projection error for $X_{ik}$ projected on all other covariates. One interpretation of coefficients in a population multiple regression is they represent the relationship between the outcome and the covariate after removing the linear relationships of all other variables. \n\n\n## Omitted variable bias\n\nIn many situations, we might need to choose whether to include a variable in a regression or not, so it can be helpful to understand how this choice might affect the population coefficients on the other variables in the regression. Suppose we have a variable $Z_i$ that we may add to our regression which currently has $\\X_i$ as the covariates. We can write this new projection as \n$$ \nm(\\X_i, Z_i) = \\X_i'\\bfbeta + Z_i\\gamma, \\qquad m(\\X_{i}) = \\X_i'\\bs{\\delta},\n$$\nwhere we often refer to $m(\\X_i, Z_i)$ as the long regression and $m(\\X_i)$ as the short regression. \n\nWe know from the definition of the BLP that we can write the short coefficients as \n$$ \n\\bs{\\delta} = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1} \\E[\\X_{i}Y_{i}].\n$$\nLetting $e_i = Y_i - m(\\X_{i}, Z_{i})$ be the projection errors from the long regression, we can write this as\n$$ \n\\begin{aligned}\n \\bs{\\delta} &= \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1} \\E[\\X_{i}(\\X_{i}'\\bfbeta + Z_{i}\\gamma + e_{i})] \\\\\n &= \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}(\\E[\\X_{i}\\X_{i}']\\bfbeta + \\E[\\X_{i}Z_{i}]\\gamma + \\E[\\X_{i}e_{i}]) \\\\\n &= \\bfbeta + \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Z_{i}]\\gamma\n\\end{aligned}\n$$\nNotice that the vector in the second term is the linear projection coefficients of a population linear regression of $Z_i$ on the $\\X_i$. If we call these coefficients $\\bs{\\pi}$, then the short coefficients are \n$$ \n\\bs{\\delta} = \\bfbeta + \\bs{\\pi}\\gamma. \n$$\n\nWe can rewrite this to show that the difference between the coefficients in these two projections is $\\bs{\\delta} - \\bfbeta= \\bs{\\pi}\\gamma$ or the product of the coefficient on the \"excluded\" $Z_i$ and the coefficient of the included $\\X_i$ on the excluded. Most textbooks refer to this difference as the **omitted variable bias** of omitting $Z_i$ under the idea that $\\bfbeta$ is the true target of inference. But the result is much broader than this since it just tells us how to relate the coefficients of two nested projections. \n\n\nThe last two results (multiple regressions from bivariate and omitted variable bias) are sometimes presented as results for the ordinary least squares estimator that we will show in the next chapter. We introduce them here as features of a particular population quantity, the linear projection or population linear regression. \n\n\n## Drawbacks of the BLP\n\nThe best linear predictor is, of course, a *linear* approximation to the CEF, and this approximation could be quite poor if the true CEF is highly nonlinear. A more subtle issue with the BLP is that it is sensitive to the marginal distribution of the covariates when the CEF is nonlinear. Let's return to our example of voter wait times and income. In @fig-blp-limits, we show the true CEF and the BLP when we restrict income below \\$50,000 or above \\$100,000. The BLP can vary quite dramatically here. This figure is an extreme example, but the essential point will still hold as the marginal distribution of $X_i$ changes.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Linear projections for when truncating income distribution below $50k and above $100k.](06_linear_model_files/figure-html/fig-blp-limits-1.png){#fig-blp-limits width=672}\n:::\n:::\n", + "markdown": "# Linear regression {#sec-regression}\n\n\nRegression is a tool for assessing the relationship between an **outcome variable**, $Y_i$, and a set of **covariates**, $\\X_i$. In particular, these tools show how the conditional mean of $Y_i$ varies as a function of $\\X_i$. For example, we may want to know how voting poll wait times vary as a function of some socioeconomic features of the precinct, like income and racial composition. We usually accomplish this task by estimating the **regression function** or **conditional expectation function** (CEF) of the outcome given the covariates, \n$$\n\\mu(\\bfx) = \\E[Y_i \\mid \\X_i = \\bfx].\n$$\nWhy are estimation and inference for this regression function special? Why can't we just use the approaches we have seen for the mean, variance, covariance, and so on? The fundamental problem with the CEF is that there may be many, many values $\\bfx$ that can occur and many different conditional expectations that we will need to estimate. If any variable in $\\X_i$ is continuous, we must estimate an infinite number of possible values of $\\mu(\\bfx)$. Because it worsens as we add covariates to $\\X_i$, we refer to this problem as the **curse of dimensionality**. How can we resolve this with our measly finite data?\n\nIn this chapter, we will explore two ways of \"solving\" the curse of dimensionality: assuming it away and changing the quantity of interest to something easier to estimate. \n\n\nRegression is so ubiquitous in many scientific fields that it has a lot of acquired notational baggage. In particular, the labels of the $Y_i$ and $\\X_i$ vary greatly:\n\n- The outcome can also be called: the response variable, the dependent variable, the labels (in machine learning), the left-hand side variable, or the regressand. \n- The covariates are also called: the explanatory variables, the independent variables, the predictors, the regressors, inputs, or features. \n\n\n## Why do we need models?\n\nAt first glance, the connection between the CEF and parametric models might be hazy. For example, imagine we are interested in estimating the average poll wait times ($Y_i$) for Black voters ($X_i = 1$) versus non-Black voters ($X_i=0$). In that case, there are two parameters to estimate, \n$$\n\\mu(1) = \\E[Y_i \\mid X_i = 1] \\quad \\text{and}\\quad \\mu(0) = \\E[Y_i \\mid X_i = 0],\n$$\nwhich we could estimate by using the plug-in estimators that replace the population averages with their sample counterparts,\n$$ \n\\widehat{\\mu}(1) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 1)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 1)} \\qquad \\widehat{\\mu}(0) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 0)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 0)}.\n$$\nThese are just the sample averages of the wait times for Black and non-Black voters, respectively. And because the race variable here is discrete, we are simply estimating sample means within subpopulations defined by race. The same logic would apply if we had $k$ racial categories: we would have $k$ conditional expectations to estimate and $k$ (conditional) sample means. \n\nNow imagine that we want to know how the average poll wait time varies as a function of income so that $X_i$ is (essentially) continuous. Now we have a different conditional expectation for every possible dollar amount from 0 to Bill Gates's income. Imagine we pick a particular income, \\$42,238, and so we are interested in the conditional expectation $\\mu(42,238)= \\E[Y_{i}\\mid X_{i} = 42,238]$. We could use the same plug-in estimator in the discrete case, \n$$\n\\widehat{\\mu}(42,238) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 42,238)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 42,238)}.\n$$\nWhat is the problem with this estimator? In all likelihood, no units in any particular dataset have that exact income, meaning this estimator is undefined (we would be dividing by zero). \n\n\nOne solution to this problem is to use **subclassification**, turn the continuous variable into a discrete one, and proceed with the discrete approach above. We might group incomes into \\$25,000 bins and then calculate the average wait times of anyone between, say, \\$25,000 and \\$50,000 income. When we make this estimator switch for practical purposes, we need to connect it back to the DGP of interest. We could **assume** that the CEF of interest only depends on these binned means, which would mean we have: \n$$\n\\mu(x) = \n\\begin{cases}\n \\E[Y_{i} \\mid 0 \\leq X_{i} < 25,000] &\\text{if } 0 \\leq x < 25,000 \\\\\n \\E[Y_{i} \\mid 25,000 \\leq X_{i} < 50,000] &\\text{if } 25,000 \\leq x < 50,000\\\\\n \\E[Y_{i} \\mid 50,000 \\leq X_{i} < 100,000] &\\text{if } 50,000 \\leq x < 100,000\\\\\n \\vdots \\\\\n \\E[Y_{i} \\mid 200,000 \\leq X_{i}] &\\text{if } 200,000 \\leq x\\\\\n\\end{cases}\n$$\nThis approach assumes, perhaps incorrectly, that the average wait time does not vary within the bins. @fig-cef-binned shows a hypothetical joint distribution between income and wait times with the true CEF, $\\mu(x)$, shown in red. The figure also shows the bins created by subclassification and the implied CEF if we assume bin-constant means in blue. We can see that the blue function approximates the true CEF but deviates from it close to the bin edges. The trade-off is that once we make the assumption, we only have to estimate one mean for every bin rather than an infinite number of means for each possible income. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of income and poll wait times (contour plot), conditional expectation function (red), and the conditional expectation of the binned income (blue).](06_linear_model_files/figure-html/fig-cef-binned-1.png){#fig-cef-binned width=672}\n:::\n:::\n\n\nSimilarly, we could **assume** that the CEF follows a simple functional form like a line,\n$$ \n\\mu(x) = \\E[Y_{i}\\mid X_{i} = x] = \\beta_{0} + \\beta_{1} x.\n$$\nThis assumption reduces our infinite number of unknowns (the conditional mean at every possible income) to just two unknowns: the slope and intercept. As we will see, we can use the standard ordinary least squares to estimate these parameters. Notice again that if the true CEF is nonlinear, this assumption is incorrect, and any estimate based on this assumption might be biased or even inconsistent. \n\nWe call the binning and linear assumptions on $\\mu(x)$ **functional form** assumptions because they restrict the class of functions that $\\mu(x)$ can take. While powerful, these types of assumptions can muddy the roles of defining the quantity of interest and estimation. If our estimator $\\widehat{\\mu}(x)$ performs poorly, it will be difficult to tell if this is because the estimator is flawed or our functional form assumptions are incorrect. \n\nTo help clarify these issues, we will pursue a different approach: understanding what linear regression can estimate under minimal assumptions and then investigating how well this estimand approximates the true CEF. \n\n## Population linear regression {#sec-linear-projection}\n\n### Bivariate linear regression \n\n\nLet's set aside the idea of the conditional expectation function and instead focus on finding the **linear** function of a single covariate $X_i$ that best predicts the outcome. Remember that linear functions have the form $a + bX_i$. The **best linear predictor** (BLP) or **population linear regression** of $Y_i$ on $X_i$ is defined as\n$$ \nm(x) = \\beta_0 + \\beta_1 x \\quad\\text{where, }\\quad (\\beta_{0}, \\beta_{1}) = \\argmin_{(b_{0}, b_{1}) \\in \\mathbb{R}^{2}}\\; \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}].\n$$\nThat is, the best linear predictor is the line that results in the lowest mean-squared error predictions of the outcome given the covariates, averaging over the joint distribution of the data. This function is a feature of the joint distribution of the data---the DGP---and so represents something that we would like to learn about with our sample. It is an alternative to the CEF for summarizing the relationship between the outcome and the covariate, though we will see that they will sometimes be equal. We call $(\\beta_{0}, \\beta_{1})$ the **population linear regression coefficients**. Notice that $m(x)$ could differ greatly from the CEF $\\mu(x)$ if the latter is nonlinear. \n\nWe can solve for the best linear predictor using standard calculus (taking the derivative with respect to each coefficient, setting those equations equal to 0, and solving the system of equations). The first-order conditions, in this case, are\n$$ \n\\begin{aligned}\n \\frac{\\partial \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}]}{\\partial b_{0}} = \\E[-2(Y_{i} - \\beta_{0} - \\beta_{1}X_{i})] = 0 \\\\\n \\frac{\\partial \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}]}{\\partial b_{1}} = \\E[-2(Y_{i} - \\beta_{0} - \\beta_{1}X_{i})X_{i}] = 0\n\\end{aligned} \n$$\nGiven the linearity of expectations, it is easy to solve for $\\beta_0$ in terms of $\\beta_1$,\n$$ \n\\beta_{0} = \\E[Y_{i}] - \\beta_{1}\\E[X_{i}].\n$$\nWe can plug this into the first-order condition for $\\beta_1$ to get\n$$ \n\\begin{aligned}\n 0 &= \\E[Y_{i}X_{i}] - (\\E[Y_{i}] - \\beta_{1}\\E[X_{i}])\\E[X_{i}] - \\beta_{1}\\E[X_{i}^{2}] \\\\\n &= \\E[Y_{i}X_{i}] - \\E[Y_{i}]\\E[X_{i}] - \\beta_{1}(\\E[X_{i}^{2}] - \\E[X_{i}]^{2}) \\\\\n &= \\cov(X_{i},Y_{i}) - \\beta_{1}\\V[X_{i}]\\\\\n \\beta_{1} &= \\frac{\\cov(X_{i},Y_{i})}{\\V[X_{i}]}\n\\end{aligned}\n$$\n\nThus the slope on the population linear regression of $Y_i$ on $X_i$ is equal to the ratio of the covariance of the two variables divided by the variance of $X_i$. From this, we can immediately see that the covariance will determine the sign of the slope: positive covariances will lead to positive $\\beta_1$ and negative covariances will lead to negative $\\beta_1$. In addition, we can see that if $Y_i$ and $X_i$ are independent, $\\beta_1 = 0$. The slope scales this covariance by the variance of the covariate, so slopes are lower for more spread-out covariates and higher for more spread-out covariates. If we define the correlation between these variables as $\\rho_{YX}$, then we can relate the coefficient to this quantity as \n$$\n\\beta_1 = \\rho_{YX}\\sqrt{\\frac{\\V[Y_i]}{\\V[X_i]}}.\n$$\n\nCollecting together our results, we can write the population linear regression as \n$$\nm(x) = \\beta_0 + \\beta_1x = \\E[Y_i] + \\beta_1(x - \\E[X_i]),\n$$\nwhich shows how we adjust our best guess about $Y_i$ from the mean of the outcome using the covariate. \n\nIt's important to remember that the BLP, $m(x)$, and the CEF, $\\mu(x)$, are distinct entities. If the CEF is nonlinear, as in @fig-cef-blp, there will be a difference between these functions, meaning that the BLP might produce subpar predictions. Below, we will derive a formal connection between the BLP and the CEF. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Comparison of the CEF and the best linear predictor.](06_linear_model_files/figure-html/fig-cef-blp-1.png){#fig-cef-blp width=672}\n:::\n:::\n\n\n\n### Beyond linear approximations\n\nThe linear part of the best linear predictor is less restrictive than at first glance. We can easily modify the minimum MSE problem to find the best quadratic, cubic, or general polynomial function of $X_i$ that predicts $Y_i$. For example, the quadratic function of $X_i$ that best predicts $Y_i$ would be\n$$ \nm(X_i, X_i^2) = \\beta_0 + \\beta_1X_i + \\beta_2X_i^2 \\quad\\text{where}\\quad \\argmin_{(b_0,b_1,b_2) \\in \\mathbb{R}^3}\\;\\E[(Y_{i} - b_{0} - b_{1}X_{i} - b_{2}X_{i}^{2})^{2}].\n$$\nThis equation is now a quadratic function of the covariates, but it is still a linear function of the unknown parameters $(\\beta_{0}, \\beta_{1}, \\beta_{2})$ so we will call this a best linear predictor. \n\nWe could include higher order terms of $X_i$ in the same manner, and as we include more polynomial terms, $X_i^p$, the more flexible the function of $X_i$ we will capture with the BLP. When we estimate the BLP, however, we usually will pay for this flexibility in terms of overfitting and high variance in our estimates. \n\n\n### Linear prediction with multiple covariates \n\nWe now generalize the idea of a best linear predictor to a setting with an arbitrary number of covariates. In this setting, remember that the linear function will be\n\n$$ \n\\bfx'\\bfbeta = x_{1}\\beta_{1} + x_{2}\\beta_{2} + \\cdots + x_{k}\\beta_{k}.\n$$\nWe will define the **best linear predictor** (BLP) to be\n$$ \nm(\\bfx) = \\bfx'\\bfbeta, \\quad \\text{where}\\quad \\bfbeta = \\argmin_{\\mb{b} \\in \\real^k}\\; \\E\\bigl[ \\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\\bigr]\n$$\n\nThis BLP solves the same fundamental optimization problem as in the bivariate case: it chooses the set of coefficients that minimizes the mean-squared error averaging over the joint distribution of the data. \n\n\n\n::: {.callout-note}\n## Best linear projection assumptions\n\nWithout some assumptions on the joint distribution of the data, the following \"regularity conditions\" will ensure the existence of the BLP:\n\n1. $\\E[Y^2] < \\infty$ (outcome has finite mean/variance)\n2. $\\E\\Vert \\mb{X} \\Vert^2 < \\infty$ ($\\mb{X}$ has finite means/variances/covariances)\n3. $\\mb{Q}_{\\mb{XX}} = \\E[\\mb{XX}']$ is positive definite (columns of $\\X$ are linearly independent) \n:::\n\nUnder these assumptions, it is possible to derive a closed-form expression for the **population coefficients** $\\bfbeta$ using matrix calculus. To set up the optimization problem, we will find the first-order condition by taking the derivative of the expectation of the squared errors. First, let's take the derivative of the squared prediction errors using the chain rule:\n$$ \n\\begin{aligned}\n \\frac{\\partial}{\\partial \\mb{b}}\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)^{2}\n &= 2\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)\\frac{\\partial}{\\partial \\mb{b}}(Y_{i} - \\X_{i}'\\mb{b}) \\\\\n &= -2\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)\\X_{i} \\\\\n &= -2\\X_{i}\\left(Y_{i} - \\X_{i}'\\mb{b}\\right) \\\\\n &= -2\\left(\\X_{i}Y_{i} - \\X_{i}\\X_{i}'\\mb{b}\\right),\n\\end{aligned}\n$$\nwhere the third equality comes from the fact that $(Y_{i} - \\X_{i}'\\bfbeta)$ is a scalar. We can now plug this into the expectation to get the first-order condition and solve for $\\bfbeta$,\n$$ \n\\begin{aligned}\n 0 &= -2\\E[\\X_{i}Y_{i} - \\X_{i}\\X_{i}'\\bfbeta ] \\\\\n \\E[\\X_{i}\\X_{i}'] \\bfbeta &= \\E[\\X_{i}Y_{i}],\n\\end{aligned}\n$$\nwhich implies the population coefficients are\n$$ \n\\bfbeta = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] = \\mb{Q}_{\\mb{XX}}^{-1}\\mb{Q}_{\\mb{X}Y}\n$$\nWe now have an expression for the coefficients for the population best linear predictor in terms of the joint distribution $(Y_{i}, \\X_{i})$. A couple of facts might be useful for reasoning this expression. Recall that $\\mb{Q}_{\\mb{XX}} = \\E[\\X_{i}\\X_{i}']$ is a $k\\times k$ matrix and $\\mb{Q}_{\\X Y} = \\E[\\X_{i}Y_{i}]$ is a $k\\times 1$ column vector, which implies that $\\bfbeta$ is also a $k \\times 1$ column vector. \n\n::: {.callout-note}\n\nIntuitively, what is happening in the expression for the population regression coefficients? It is helpful to separate the intercept or constant term so that we have\n$$ \nY_{i} = \\beta_{0} + \\X'\\bfbeta + e_{i},\n$$\nso $\\bfbeta$ refers to just the vector of coefficients for the covariates. In this case, we can write the coefficients in a more interpretable way:\n$$ \n\\bfbeta = \\V[\\X]^{-1}\\text{Cov}(\\X, Y), \\qquad \\beta_0 = \\mu_Y - \\mb{\\mu}'_{\\mb{X}}\\bfbeta\n$$\n\nThus, the population coefficients take the covariance between the outcome and the covariates and \"divide\" it by information about variances and covariances of the covariates. The intercept recenters the regression so that projection errors are mean zero. Thus, we can see that these coefficients generalize the bivariate formula to this multiple covariate context. \n:::\n\nWith an expression for the population linear regression coefficients, we can write the linear projection as \n$$ \nm(\\X_{i}) = \\X_{i}'\\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] = \\X_{i}'\\mb{Q}_{\\mb{XX}}^{-1}\\mb{Q}_{\\mb{X}Y}\n$$\n\n\n\n### Projection error\n\nThe **projection error** is the difference between the actual value of $Y_i$ and the projection,\n$$ \ne_{i} = Y_{i} - m(\\X_{i}) = Y_i - \\X_{i}'\\bfbeta,\n$$\nwhere we have made no assumptions about this error yet. The projection error is simply the prediction error of the best linear prediction. Rewriting this definition, we can see that we can always write the outcome as the linear projection plus the projection error,\n$$ \nY_{i} = \\X_{i}'\\bfbeta + e_{i}.\n$$\nNotice that this looks suspiciously similar to a linearity assumption on the CEF, but we haven't made any assumptions here. Instead, we have just used the definition of the projection error to write a tautological statement: \n$$ \nY_{i} = \\X_{i}'\\bfbeta + e_{i} = \\X_{i}'\\bfbeta + Y_{i} - \\X_{i}'\\bfbeta = Y_{i}.\n$$\nThe critical difference between this representation and the usual linear model assumption is what properties $e_{i}$ possesses. \n\nOne key property of the projection errors is that when the covariate vector includes an \"intercept\" or constant term, the projection errors are uncorrelated with the covariates. To see this, we first note that $\\E[\\X_{i}e_{i}] = 0$ since\n$$ \n\\begin{aligned}\n \\E[\\X_{i}e_{i}] &= \\E[\\X_{{i}}(Y_{i} - \\X_{i}'\\bfbeta)] \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}\\X_{i}']\\bfbeta \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}\\X_{i}']\\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}Y_{i}] = 0\n\\end{aligned}\n$$\nThus, for every $X_{ij}$ in $\\X_{i}$, we have $\\E[X_{ij}e_{i}] = 0$. If one of the entries in $\\X_i$ is a constant 1, then this also implies that $\\E[e_{i}] = 0$. Together, these facts imply that the projection error is uncorrelated with each $X_{ij}$, since\n$$ \n\\cov(X_{ij}, e_{i}) = \\E[X_{ij}e_{i}] - \\E[X_{ij}]\\E[e_{i}] = 0 - 0 = 0\n$$\nNotice that we still have made no assumptions about these projection errors except for some mild regularity conditions on the joint distribution of the outcome and covariates. Thus, in very general settings, we can write the linear projection model $Y_i = \\X_i'\\bfbeta + e_i$ where $\\bfbeta = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}]$ and conclude that $\\E[\\X_{i}e_{i}] = 0$ by definition, not by assumption. \n\nThe projection error is uncorrelated with the covariates, so does this mean that the CEF is linear? Unfortunately, no. Recall that while independence implies uncorrelated, the reverse does not hold. So when we look at the CEF, we have\n$$ \n\\E[Y_{i} \\mid \\X_{i}] = \\X_{i}'\\bfbeta + \\E[e_{i} \\mid \\X_{i}],\n$$\nand the last term $\\E[e_{i} \\mid \\X_{i}]$ would only be 0 if the errors were independent of the covariates, so $\\E[e_{i} \\mid \\X_{i}] = \\E[e_{i}] = 0$. But nowhere in the linear projection model did we assume this. So while we can (almost) always write the outcome as $Y_i = \\X_i'\\bfbeta + e_i$ and have those projection errors be uncorrelated with the covariates, it will require additional assumptions to ensure that the true CEF is, in fact, linear $\\E[Y_{i} \\mid \\X_{i}] = \\X_{i}'\\bfbeta$. \n\nLet's take a step back. What have we shown here? In a nutshell, we have shown that a population linear regression exists under very general conditions, and we can write the coefficients of that population linear regression as a function of expectations of the joint distribution of the data. We did not assume that the CEF was linear nor that the projection errors were normal. \n\n\nWhy do we care about this? The ordinary least squares estimator, the workhorse regression estimator, targets this quantity of interest in large samples, regardless of whether the true CEF is linear or not. Thus, even when a linear CEF assumption is incorrect, OLS still targets a perfectly valid quantity of interest: the coefficients from this population linear projection. \n\n## Linear CEFs without assumptions\n\nWhat is the relationship between the best linear predictor (which we just saw generally exists) and the CEF? To draw the connection, remember a vital property of the conditional expectation: it is the function of $\\X_i$ that best predicts $Y_{i}$. The population regression was the best **linear** predictor, but the CEF is the best predictor among all nicely behaved functions of $\\X_{i}$, linear or nonlinear. In particular, if we label $L_2$ to be the set of all functions of the covariates $g()$ that have finite squared expectation, $\\E[g(\\X_{i})^{2}] < \\infty$, then we can show that the CEF has the lowest squared prediction error in this class of functions:\n$$ \n\\mu(\\X) = \\E[Y_{i} \\mid \\X_{i}] = \\argmin_{g(\\X_i) \\in L_2}\\; \\E\\left[(Y_{i} - g(\\X_{i}))^{2}\\right],\n$$\n\nSo we have established that the CEF is the best predictor and the population linear regression $m(\\X_{i})$ is the best linear predictor. These two facts allow us to connect the CEF and the population regression.\n\n::: {#thm-cef-blp}\nIf $\\mu(\\X_{i})$ is a linear function of $\\X_i$, then $\\mu(\\X_{i}) = m(\\X_{i}) = \\X_i'\\bfbeta$. \n\n:::\n\nThis theorem says that if the true CEF is linear, it equals the population linear regression. The proof of this is straightforward: the CEF is the best predictor, so if it is linear, it must also be the best linear predictor. \n \n \nIn general, we are in the business of learning about the CEF, so we are unlikely to know if it genuinely is linear or not. In some situations, however, we can show that the CEF is linear without any additional assumptions. These will be situations when the covariates take on a finite number of possible values. Suppose we are interested in the CEF of poll wait times for Black ($X_i = 1$) vs. non-Black ($X_i = 0$) voters. In this case, there are two possible values of the CEF, $\\mu(1) = \\E[Y_{i}\\mid X_{i}= 1]$, the average wait time for Black voters, and $\\mu(0) = \\E[Y_{i}\\mid X_{i} = 0]$, the average wait time for non-Black voters. Notice that we can write the CEF as\n$$ \n\\mu(x) = x \\mu(1) + (1 - x) \\mu(0) = \\mu(0) + x\\left(\\mu(1) - \\mu(0)\\right)= \\beta_0 + x\\beta_1,\n$$\nwhich is clearly a linear function of $x$. Based on this derivation, we can see that the coefficients of this linear CEF have a clear interpretation:\n\n- $\\beta_0 = \\mu(0)$: the expected wait time for a Black voter. \n- $\\beta_1 = \\mu(1) - \\mu(0)$: the difference in average wait times between Black and non-Black voters. \nNotice that it matters how $X_{i}$ is defined here since the intercept will always be the average outcome when $X_i = 0$, and the slope will always be the difference in means between the $X_i = 1$ group and the $X_i = 0$ group. \n\nWhat about a categorical covariate with more than two levels? For instance, we might be interested in wait times by party identification, where $X_i = 1$ indicates Democratic voters, $X_i = 2$ indicates Republican voters, and $X_i = 3$ indicates independent voters. How can we write the CEF of wait times as a linear function of this variable? That would assume that the difference between Democrats and Republicans is the same as for Independents and Republicans. With more than two levels, we can represent a categorical variable as a vector of binary variables, $\\X_i = (X_{i1}, X_{i2})$, where\n$$ \n\\begin{aligned}\n X_{{i1}} &= \\begin{cases}\n 1&\\text{if Republican} \\\\\n 0 & \\text{if not Republican}\n \\end{cases} \\\\\nX_{{i2}} &= \\begin{cases}\n 1&\\text{if independent} \\\\\n 0 & \\text{if not independent}\n \\end{cases} \\\\\n\\end{aligned}\n$$\nThese two indicator variables encode the same information as the original three-level variable, $X_{i}$. If I know the values of $X_{i1}$ and $X_{i2}$, I know exactly what party to which $i$ belongs. Thus, the CEFs for $X_i$ and the pair of indicator variables, $\\X_i$, are precisely the same, but the latter admits a lovely linear representation,\n$$\n\\E[Y_i \\mid X_{i1}, X_{i2}] = \\beta_0 + \\beta_1 X_{i1} + \\beta_2 X_{i2},\n$$\nwhere\n\n- $\\beta_0 = \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the average wait time for the group who does not get an indicator variable (Democrats in this case). \n- $\\beta_1 = \\E[Y_{i} \\mid X_{i1} = 1, X_{i2} = 0] - \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the difference in means between Republican voters and Democratic voters, or the difference between the first indicator group and the baseline group. \n- $\\beta_2 = \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 1] - \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the difference in means between independent voters and Democratic voters, or the difference between the second indicator group and the baseline group.\n\nThis approach easily generalizes to categorical variables with an arbitrary number of levels. \n\nWhat have we shown? The CEF will be linear without additional assumptions when there is a categorical covariate. We can show that this continues to hold even when we have multiple categorical variables. We now have two binary covariates: $X_{i1}=1$ indicating a Black voter, and $X_{i2} = 1$ indicating an urban voter. With these two binary variables, there are four possible values of the CEF:\n$$ \n\\mu(x_1, x_2) = \\begin{cases} \n \\mu_{00} & \\text{if } x_1 = 0 \\text{ and } x_2 = 0 \\text{ (non-Black, rural)} \\\\\n \\mu_{10} & \\text{if } x_1 = 1 \\text{ and } x_2 = 0 \\text{ (Black, rural)} \\\\\n \\mu_{01} & \\text{if } x_1 = 0 \\text{ and } x_2 = 1 \\text{ (non-Black, urban)} \\\\\n \\mu_{11} & \\text{if } x_1 = 1 \\text{ and } x_2 = 1 \\text{ (Black, urban)}\n \\end{cases}\n$$\nWe can write this as\n$$ \n\\mu(x_{1}, x_{2}) = (1 - x_{1})(1 - x_{2})\\mu_{00} + x_{1}(1 -x_{2})\\mu_{10} + (1-x_{1})x_{2}\\mu_{01} + x_{1}x_{2}\\mu_{11},\n$$\nwhich we can rewrite as \n$$ \n\\mu(x_1, x_2) = \\beta_0 + x_1\\beta_1 + x_2\\beta_2 + x_1x_2\\beta_3,\n$$\nwhere\n\n- $\\beta_0 = \\mu_{00}$: average wait times for rural non-Black voters. \n- $\\beta_1 = \\mu_{10} - \\mu_{00}$: difference in means for rural Black vs. rural non-Black voters. \n- $\\beta_2 = \\mu_{01} - \\mu_{00}$: difference in means for urban non-Black vs. rural non-Black voters. \n- $\\beta_3 = (\\mu_{11} - \\mu_{01}) - (\\mu_{10} - \\mu_{00})$: difference in urban racial difference vs rural racial difference.\n\nThus, we can write the CEF with two binary covariates as linear when the linear specification includes a multiplicative interaction between them ($x_1x_2$). This result holds for all pairs of binary covariates, and we can generalize the interpretation of the coefficients in the CEF as\n\n- $\\beta_0 = \\mu_{00}$: average outcome when both variables are 0. \n- $\\beta_1 = \\mu_{10} - \\mu_{00}$: difference in average outcomes for the first covariate when the second covariate is 0. \n- $\\beta_2 = \\mu_{01} - \\mu_{00}$: difference in average outcomes for the second covariate when the first covariate is 0. \n- $\\beta_3 = (\\mu_{11} - \\mu_{01}) - (\\mu_{10} - \\mu_{00})$: change in the \"effect\" of the first (second) covariate when the second (first) covariate goes from 0 to 1. \n\nThis result also generalizes to an arbitrary number of binary covariates. If we have $p$ binary covariates, then the CEF will be linear with all two-way interactions, $x_1x_2$, all three-way interactions, $x_1x_2x_3$, up to the $p$-way interaction $x_1\\times\\cdots\\times x_p$. Furthermore, we can generalize to arbitrary numbers of categorical variables by expanding each into a series of binary variables and then including all interactions between the resulting binary variables. \n\n\nWe have established that when we have a set of categorical covariates, the true CEF will be linear, and we have seen the various ways to represent that CEF. Notice that when we use, for example, ordinary least squares, we are free to choose how to include our variables. That means that we could run a regression of $Y_i$ on $X_{i1}$ and $X_{i2}$ without an interaction term. This model will only be correct if $\\beta_3$ is equal to 0, and so the interaction term is irrelevant. Because of this ability to choose our models, it's helpful to have a language for models that capture the linear CEF appropriately. We call a model **saturated** if there are as many coefficients as the CEF's unique values. A saturated model, by its nature, can always be written as a linear function without assumptions. The above examples show how to construct saturated models in various situations.\n\n## Interpretation of the regression coefficients\n\nWe have seen how to interpret population regression coefficients when the CEF is linear without assumptions. How do we interpret the population coefficients $\\bfbeta$ in other settings? \n\n\nLet's start with the simplest case, where every entry in $\\X_{i}$ represents a different covariate and no covariate is any function of another (we'll see why this caveat is necessary below). In this simple case, the $k$th coefficient, $\\beta_{k}$, will represent the change in the predicted outcome for a one-unit change in the $k$th covariate $X_{ik}$, holding all other covariates fixed. We can see this from \n$$ \n\\begin{aligned}\n m(x_{1} + 1, x_{2}) & = \\beta_{0} + \\beta_{1}(x_{1} + 1) + \\beta_{2}x_{2} \\\\\n m(x_{1}, x_{2}) &= \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{2},\n\\end{aligned} \n$$\nso that the change in the predicted outcome for increasing $X_{i1}$ by one unit is\n$$\n m(x_{1} + 1, x_{2}) - m(x_{1}, x_{2}) = \\beta_1\n$$\nNotice that nothing changes in this interpretation if we add more covariates to the vector,\n$$\n m(x_{1} + 1, \\bfx_{2}) - m(x_{1}, \\bfx_{2}) = \\beta_1,\n$$\nthe coefficient on a particular variable is the change in the predicted outcome for a one-unit change in the covariate holding all other covariates constant. Each coefficient summarizes the \"all else equal\" difference in the predicted outcome for each covariate. \n\n\n### Polynomial functions of the covariates\n\n\n\nThe interpretation of the population regression coefficients becomes more complicated when we include nonlinear functions of the covariates. In that case, multiple coefficients control how a change in a covariate will change the predicted value of $Y_i$. Suppose that we have a quadratic function of $X_{i1}$,\n$$ \nm(x_1, x_1^2, x_{2}) = \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{1}^{2} + \\beta_{3}x_{2},\n$$\nand try to look at a one-unit change in $x_1$,\n$$ \n\\begin{aligned}\n m(x_{1} + 1, (x_{1} + 1)^{2}, x_{2}) & = \\beta_{0} + \\beta_{1}(x_{1} + 1) + \\beta_{2}(x_{1} + 1)^{2}+ \\beta_{3}x_{2} \\\\\n m(x_{1}, x_{1}^{2}, x_{2}) &= \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{1}^{2} + \\beta_{3}x_{2},\n\\end{aligned} \n$$\nresulting in $\\beta_1 + \\beta_2(2x_{1} + 1)$. This formula might be an interesting quantity, but we will more commonly use the derivative of $m(\\bfx)$ with respect to $x_1$ as a measure of the marginal effect of $X_{i1}$ on the predicted value of $Y_i$ (holding all other variables constant), where \"marginal\" here means the change in prediction for a very small change in $X_{i1}$.[^effect] In the case of the quadratic covariate, we have\n$$ \n\\frac{\\partial m(x_{1}, x_{1}^{2}, x_{2})}{\\partial x_{1}} = \\beta_{1} + 2\\beta_{2}x_{1},\n$$\nso the marginal effect on prediction varies as a function of $x_1$. From this, we can see that the individual interpretations of the coefficients are less interesting: $\\beta_1$ is the marginal effect when $X_{i1} = 0$ and $\\beta_2 / 2$ describes how a one-unit change in $X_{i1}$ changes the marginal effect. As is hopefully clear, it will often be more straightforward to visualize the nonlinear predictor function (perhaps using the orthogonalization techniques in @sec-fwl). \n\n\n[^effect]: Notice the choice of language here. The marginal effect is on the predicted value of $Y_i$, not on $Y_i$ itself. So these marginal effects are associational, not necessarily causal quantities. \n\n### Interactions\n\nAnother common nonlinear function of the covariates is when we include **interaction terms** or covariates that are products of two other covariates,\n$$ \nm(x_{1}, x_{2}, x_{1}x_{2}) = \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{2} + \\beta_{3}x_{1}x_{2}.\n$$\nIn these situations, we can also use the derivative of the BLP to measure the marginal effect of one variable or the other on the predicted value of $Y_i$. In particular, we have\n$$ \n\\begin{aligned}\n \\frac{\\partial m(x_{1}, x_{2}, x_{1}x_{2})}{\\partial x_1} &= \\beta_1 + \\beta_3x_2, \\\\\n \\frac{\\partial m(x_{1}, x_{2}, x_{1}x_{2})}{\\partial x_2} &= \\beta_2 + \\beta_3x_1.\n\\end{aligned}\n$$\nHere, the coefficients are slightly more interpretable:\n\n* $\\beta_1$: the marginal effect of $X_{i1}$ on predicted $Y_i$ when $X_{i2} = 0$.\n* $\\beta_2$: the marginal effect of $X_{i2}$ on predicted $Y_i$ when $X_{i1} = 0$.\n* $\\beta_3$: the change in the marginal effect of $X_{i1}$ due to a one-unit change in $X_{i2}$ **OR** the change in the marginal effect of $X_{i2}$ due to a one-unit change in $X_{i1}$.\n\nIf we add more covariates to this BLP, these interpretations change to \"holding all other covariates constant.\"\n\nInteractions are a routine part of social science research because they allow us to assess how the relationship between the outcome and an independent variable varies by the values of another variable. In the context of our study of voter wait times, if $X_{i1}$ is income and $X_{i2}$ is the Black/non-Black voter indicator, then $\\beta_3$ represents the change in the slope of the wait time-income relationship between Black and non-Black voters. \n\n\n## Multiple regression from bivariate regression {#sec-fwl}\n\nWhen we have a regression of an outcome on two covariates, it is helpful to understand how the coefficients of one variable relate to the other. For example, if we have the following best linear projection:\n$$ \n(\\alpha, \\beta, \\gamma) = \\argmin_{(a,b,c) \\in \\mathbb{R}^{3}} \\; \\E[(Y_{i} - (a + bX_{i} + cZ_{i}))^{2}]\n$$ {#eq-two-var-blp}\nIs there some way to understand the $\\beta$ coefficient here regarding simple linear regression? As it turns out, yes. From the above results, we know that the intercept has a simple form:\n$$\n\\alpha = \\E[Y_i] - \\beta\\E[X_i] - \\gamma\\E[Z_i].\n$$\nLet's investigate the first order condition for $\\beta$:\n$$ \n\\begin{aligned}\n 0 &= \\E[Y_{i}X_{i}] - \\alpha\\E[X_{i}] - \\beta\\E[X_{i}^{2}] - \\gamma\\E[X_{i}Z_{i}] \\\\\n &= \\E[Y_{i}X_{i}] - \\E[Y_{i}]\\E[X_{i}] + \\beta\\E[X_{i}]^{2} + \\gamma\\E[X_{i}]\\E[Z_{i}] - \\beta\\E[X_{i}^{2}] - \\gamma\\E[X_{i}Z_{i}] \\\\\n &= \\cov(Y, X) - \\beta\\V[X_{i}] - \\gamma \\cov(X_{i}, Z_{i})\n\\end{aligned}\n$$\nWe can see from this that if $\\cov(X_{i}, Z_{i}) = 0$, then the coefficient on $X_i$ will be the same as in the simple regression case, $\\cov(Y_{i}, X_{i})/\\V[X_{i}]$. When $X_i$ and $Z_i$ are uncorrelated, we sometimes call them **orthogonal**. \n\nTo write a simple formula for $\\beta$ when the covariates are not orthogonal, we will **orthogonalize** $X_i$ by obtaining the prediction errors from a population linear regression of $X_i$ on $Z_i$:\n$$ \n\\widetilde{X}_{i} = X_{i} - (\\delta_{0} + \\delta_{1}Z_{i}) \\quad\\text{where}\\quad (\\delta_{0}, \\delta_{1}) = \\argmin_{(d_{0},d_{1}) \\in \\mathbb{R}^{2}} \\; \\E[(X_{i} - (d_{0} + d_{1}Z_{i}))^{2}]\n$$\nGiven the properties of projection errors, we know that this orthogonalized version of $X_{i}$ will be uncorrelated with $Z_{i}$ since $\\E[\\widetilde{X}_{i}Z_{i}] = 0$. Remarkably, the coefficient on $X_i$ from the \"long\" BLP in @eq-two-var-blp is the same as the regression of $Y_i$ on this orthogonalized $\\widetilde{X}_i$, \n$$ \n\\beta = \\frac{\\text{cov}(Y_{i}, \\widetilde{X}_{i})}{\\V[\\widetilde{X}_{i}]}\n$$\n\nWe can expand this idea to when there are several other covariates. Suppose now that we are interested in a regression of $Y_i$ on $\\X_i$ and we are interested in the coefficient on the $k$th covariate. Let $\\X_{i,-k}$ be the vector of covariates omitting the $k$th entry and let $m_k(\\X_{i,-k})$ represent the BLP of $X_{ik}$ on these other covariates. We can define $\\widetilde{X}_{ik} = X_{ik} - m_{k}(\\X_{i,-k})$ as the $k$th variable orthogonalized with respect to the rest of the variables and we can write the coefficient on $X_{ik}$ as\n$$ \n\\beta_k = \\frac{\\cov(Y_i, \\widetilde{X}_{ik})}{\\V[\\widetilde{X}_{ik}]}.\n$$ \nThus, the population regression coefficient in the BLP is the same as from a bivariate regression of the outcome on the projection error for $X_{ik}$ projected on all other covariates. One interpretation of coefficients in a population multiple regression is they represent the relationship between the outcome and the covariate after removing the linear relationships of all other variables. \n\n\n## Omitted variable bias\n\nIn many situations, we might need to choose whether to include a variable in a regression or not, so it can be helpful to understand how this choice might affect the population coefficients on the other variables in the regression. Suppose we have a variable $Z_i$ that we may add to our regression which currently has $\\X_i$ as the covariates. We can write this new projection as \n$$ \nm(\\X_i, Z_i) = \\X_i'\\bfbeta + Z_i\\gamma, \\qquad m(\\X_{i}) = \\X_i'\\bs{\\delta},\n$$\nwhere we often refer to $m(\\X_i, Z_i)$ as the long regression and $m(\\X_i)$ as the short regression. \n\nWe know from the definition of the BLP that we can write the short coefficients as \n$$ \n\\bs{\\delta} = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1} \\E[\\X_{i}Y_{i}].\n$$\nLetting $e_i = Y_i - m(\\X_{i}, Z_{i})$ be the projection errors from the long regression, we can write this as\n$$ \n\\begin{aligned}\n \\bs{\\delta} &= \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1} \\E[\\X_{i}(\\X_{i}'\\bfbeta + Z_{i}\\gamma + e_{i})] \\\\\n &= \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}(\\E[\\X_{i}\\X_{i}']\\bfbeta + \\E[\\X_{i}Z_{i}]\\gamma + \\E[\\X_{i}e_{i}]) \\\\\n &= \\bfbeta + \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Z_{i}]\\gamma\n\\end{aligned}\n$$\nNotice that the vector in the second term is the linear projection coefficients of a population linear regression of $Z_i$ on the $\\X_i$. If we call these coefficients $\\bs{\\pi}$, then the short coefficients are \n$$ \n\\bs{\\delta} = \\bfbeta + \\bs{\\pi}\\gamma. \n$$\n\nWe can rewrite this to show that the difference between the coefficients in these two projections is $\\bs{\\delta} - \\bfbeta= \\bs{\\pi}\\gamma$ or the product of the coefficient on the \"excluded\" $Z_i$ and the coefficient of the included $\\X_i$ on the excluded. Most textbooks refer to this difference as the **omitted variable bias** of omitting $Z_i$ under the idea that $\\bfbeta$ is the true target of inference. But the result is much broader than this since it just tells us how to relate the coefficients of two nested projections. \n\n\nThe last two results (multiple regressions from bivariate and omitted variable bias) are sometimes presented as results for the ordinary least squares estimator that we will show in the next chapter. We introduce them here as features of a particular population quantity, the linear projection or population linear regression. \n\n\n## Drawbacks of the BLP\n\nThe best linear predictor is, of course, a *linear* approximation to the CEF, and this approximation could be quite poor if the true CEF is highly nonlinear. A more subtle issue with the BLP is that it is sensitive to the marginal distribution of the covariates when the CEF is nonlinear. Let's return to our example of voter wait times and income. In @fig-blp-limits, we show the true CEF and the BLP when we restrict income below \\$50,000 or above \\$100,000. The BLP can vary quite dramatically here. This figure is an extreme example, but the essential point will still hold as the marginal distribution of $X_i$ changes.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Linear projections for when truncating income distribution below $50k and above $100k.](06_linear_model_files/figure-html/fig-blp-limits-1.png){#fig-blp-limits width=672}\n:::\n:::\n", "supporting": [ "06_linear_model_files/figure-html" ], diff --git a/_freeze/06_linear_model/execute-results/tex.json b/_freeze/06_linear_model/execute-results/tex.json index 2cfb852..20f66d3 100644 --- a/_freeze/06_linear_model/execute-results/tex.json +++ b/_freeze/06_linear_model/execute-results/tex.json @@ -1,7 +1,7 @@ { - "hash": "462f1dcdf8bf485c66f7cba2bc47ca4d", + "hash": "1771ecc1d57037175f1ca8d02824cb54", "result": { - "markdown": "# Linear regression {#sec-regression}\n\n\nRegression is a tool for assessing the relationship between an **outcome variable**, $Y_i$, and a set of **covariates**, $\\X_i$. In particular, these tools show how the conditional mean of $Y_i$ varies as a function of $\\X_i$. For example, we may want to know how voting poll wait times vary as a function of some socioeconomic features of the precinct, like income and racial composition. We usually accomplish this task by estimating the **regression function** or **conditional expectation function** (CEF) of the outcome given the covariates, \n$$\n\\mu(\\bfx) = \\E[Y_i \\mid \\X_i = \\bfx].\n$$\nWhy are estimation and inference for this regression function special? Why can't we just use the approaches we have seen for the mean, variance, covariance, and so on? The fundamental problem with the CEF is that there may be many, many values $\\bfx$ that can occur and many different conditional expectations that we will need to estimate. If any variable in $\\X_i$ is continuous, we must estimate an infinite number of possible values of $\\mu(\\bfx)$. Because it worsens as we add covariates to $\\X_i$, we refer to this problem as the **curse of dimensionality**. How can we resolve this with our measly finite data?\n\nIn this chapter, we will explore two ways of \"solving\" the curse of dimensionality: assuming it away and changing the quantity of interest to something easier to estimate. \n\n\nRegression is so ubiquitous in many scientific fields that it has a lot of acquired notational baggage. In particular, the labels of the $Y_i$ and $\\X_i$ vary greatly:\n\n- The outcome can also be called: the response variable, the dependent variable, the labels (in machine learning), the left-hand side variable, or the regressand. \n- The covariates are also called: the explanatory variables, the independent variables, the predictors, the regressors, inputs, or features. \n\n\n## Why do we need models?\n\nAt first glance, the connection between the CEF and parametric models might be hazy. For example, imagine we are interested in estimating the average poll wait times ($Y_i$) for Black voters ($X_i = 1$) versus non-Black voters ($X_i=0$). In that case, there are two parameters to estimate, \n$$\n\\mu(1) = \\E[Y_i \\mid X_i = 1] \\quad \\text{and}\\quad \\mu(0) = \\E[Y_i \\mid X_i = 0],\n$$\nwhich we could estimate by using the plug-in estimators that replace the population averages with their sample counterparts,\n$$ \n\\widehat{\\mu}(1) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 1)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 1)} \\qquad \\widehat{\\mu}(0) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 0)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 0)}.\n$$\nThese are just the sample averages of the wait times for Black and non-Black voters, respectively. And because the race variable here is discrete, we are simply estimating sample means within subpopulations defined by race. The same logic would apply if we had $k$ racial categories: we would have $k$ conditional expectations to estimate and $k$ (conditional) sample means. \n\nNow imagine that we want to know how the average poll wait time varies as a function of income so that $X_i$ is (essentially) continuous. Now we have a different conditional expectation for every possible dollar amount from 0 to Bill Gates's income. Imagine we pick a particular income, \\$42,238, and so we are interested in the conditional expectation $\\mu(42,238)= \\E[Y_{i}\\mid X_{i} = 42,238]$. We could use the same plug-in estimator in the discrete case, \n$$\n\\widehat{\\mu}(42,238) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 42,238)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 42,238)}.\n$$\nWhat is the problem with this estimator? In all likelihood, no units in any particular dataset have that exact income, meaning this estimator is undefined (we would be dividing by zero). \n\n\nOne solution to this problem is to use **subclassification**, turn the continuous variable into a discrete one, and proceed with the discrete approach above. We might group incomes into \\$25,000 bins and then calculate the average wait times of anyone between, say, \\$25,000 and \\$50,000 income. When we make this estimator switch for practical purposes, we need to connect it back to the DGP of interest. We could **assume** that the CEF of interest only depends on these binned means, which would mean we have: \n$$\n\\mu(x) = \n\\begin{cases}\n \\E[Y_{i} \\mid 0 \\leq X_{i} < 25,000] &\\text{if } 0 \\leq x < 25,000 \\\\\n \\E[Y_{i} \\mid 25,000 \\leq X_{i} < 50,000] &\\text{if } 25,000 \\leq x < 50,000\\\\\n \\E[Y_{i} \\mid 50,000 \\leq X_{i} < 100,000] &\\text{if } 50,000 \\leq x < 100,000\\\\\n \\vdots \\\\\n \\E[Y_{i} \\mid 200,000 \\leq X_{i}] &\\text{if } 200,000 \\leq x\\\\\n\\end{cases}\n$$\nThis approach assumes, perhaps incorrectly, that the average wait time does not vary within the bins. @fig-cef-binned shows a hypothetical joint distribution between income and wait times with the true CEF, $\\mu(x)$, shown in red. The figure also shows the bins created by subclassification and the implied CEF if we assume bin-constant means in blue. We can see that the blue function approximates the true CEF but deviates from it close to the bin edges. The trade-off is that once we make the assumption, we only have to estimate one mean for every bin rather than an infinite number of means for each possible income. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of income and poll wait times (contour plot), conditional expectation function (red), and the conditional expectation of the binned income (blue).](06_linear_model_files/figure-pdf/fig-cef-binned-1.pdf){#fig-cef-binned}\n:::\n:::\n\n\n\nSimilarly, we could **assume** that the CEF follows a simple functional form like a line,\n$$ \n\\mu(x) = \\E[Y_{i}\\mid X_{i} = x] = \\beta_{0} + \\beta_{1} x.\n$$\nThis assumption reduces our infinite number of unknowns (the conditional mean at every possible income) to just two unknowns: the slope and intercept. As we will see, we can use the standard ordinary least squares to estimate these parameters. Notice again that if the true CEF is nonlinear, this assumption is incorrect, and any estimate based on this assumption might be biased or even inconsistent. \n\nWe call the binning and linear assumptions on $\\mu(x)$ **functional form** assumptions because they restrict the class of functions that $\\mu(x)$ can take. While powerful, these types of assumptions can muddy the roles of defining the quantity of interest and estimation. If our estimator $\\widehat{\\mu}(x)$ performs poorly, it will be difficult to tell if this is because the estimator is flawed or our functional form assumptions are incorrect. \n\nTo help clarify these issues, we will pursue a different approach: understanding what linear regression can estimate under minimal assumptions and then investigating how well this estimand approximates the true CEF. \n\n## Population linear regression {#sec-linear-projection}\n\n### Bivariate linear regression \n\n\nLet's set aside the idea of the conditional expectation function and instead focus on finding the **linear** function of a single covariate $X_i$ that best predicts the outcome. Remember that linear functions have the form $a + bX_i$. The **best linear predictor** (BLP) or **population linear regression** of $Y_i$ on $X_i$ is defined as\n$$ \nm(x) = \\beta_0 + \\beta_1 x \\quad\\text{where, }\\quad (\\beta_{0}, \\beta_{1}) = \\argmin_{(b_{0}, b_{1}) \\in \\mathbb{R}^{2}}\\; \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}].\n$$\nThat is, the best linear predictor is the line that results in the lowest mean-squared error predictions of the outcome given the covariates, averaging over the joint distribution of the data. This function is a feature of the joint distribution of the data---the DGP---and so represents something that we would like to learn about with our sample. It is an alternative to the CEF for summarizing the relationship between the outcome and the covariate, though we will see that they will sometimes be equal. We call $(\\beta_{0}, \\beta_{1})$ the **population linear regression coefficients**. Notice that $m(x)$ could differ greatly from the CEF $\\mu(x)$ if the latter is nonlinear. \n\nWe can solve for the best linear predictor using standard calculus (taking the derivative with respect to each coefficient, setting those equations equal to 0, and solving the system of equations). The first-order conditions, in this case, are\n$$ \n\\begin{aligned}\n \\frac{\\partial \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}]}{\\partial b_{0}} = \\E[-2(Y_{i} - \\beta_{0} - \\beta_{1}X_{i})] = 0 \\\\\n \\frac{\\partial \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}]}{\\partial b_{1}} = \\E[-2(Y_{i} - \\beta_{0} - \\beta_{1}X_{i})X_{i}] = 0\n\\end{aligned} \n$$\nGiven the linearity of expectations, it is easy to solve for $\\beta_0$ in terms of $\\beta_1$,\n$$ \n\\beta_{0} = \\E[Y_{i}] - \\beta_{1}\\E[X_{i}].\n$$\nWe can plug this into the first-order condition for $\\beta_1$ to get\n$$ \n\\begin{aligned}\n 0 &= \\E[Y_{i}X_{i}] - (\\E[Y_{i}] - \\beta_{1}\\E[X_{i}])\\E[X_{i}] - \\beta_{1}\\E[X_{i}^{2}] \\\\\n &= \\E[Y_{i}X_{i}] - \\E[Y_{i}]\\E[X_{i}] - \\beta_{1}(\\E[X_{i}^{2}] - \\E[X_{i}]^{2}) \\\\\n &= \\cov(X_{i},Y_{i}) - \\beta_{1}\\V[X_{i}]\\\\\n \\beta_{1} &= \\frac{\\cov(X_{i},Y_{i})}{\\V[X_{i}]}\n\\end{aligned}\n$$\n\nThus the slope on the population linear regression of $Y_i$ on $X_i$ is equal to the ratio of the covariance of the two variables divided by the variance of $X_i$. From this, we can immediately see that the covariance will determine the sign of the slope: positive covariances will lead to positive $\\beta_1$ and negative covariances will lead to negative $\\beta_1$. In addition, we can see that if $Y_i$ and $X_i$ are independent, $\\beta_1 = 0$. The slope scales this covariance by the variance of the covariate, so slopes are lower for more spread-out covariates and higher for more spread-out covariates. If we define the correlation between these variables as $\\rho_{YX}$, then we can relate the coefficient to this quantity as \n$$\n\\beta_1 = \\rho_{YX}\\sqrt{\\frac{\\V[Y_i]}{\\V[X_i]}}.\n$$\n\nCollecting together our results, we can write the population linear regression as \n$$\nm(x) = \\beta_0 + \\beta_1x = \\E[Y_i] + \\beta_1(x - \\E[X_i]),\n$$\nwhich shows how we adjust our best guess about $Y_i$ from the mean of the outcome using the covariate. \n\nIt's important to remember that the BLP, $m(x)$, and the CEF, $\\mu(x)$, are distinct entities. If the CEF is nonlinear, as in @fig-cef-blp, there will be a difference between these functions, meaning that the BLP might produce subpar predictions. Below, we will derive a formal connection between the BLP and the CEF. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Comparison of the CEF and the best linear predictor.](06_linear_model_files/figure-pdf/fig-cef-blp-1.pdf){#fig-cef-blp}\n:::\n:::\n\n\n\n\n### Beyond linear approximations\n\nThe linear part of the best linear predictor is less restrictive than at first glance. We can easily modify the minimum MSE problem to find the best quadratic, cubic, or general polynomial function of $X_i$ that predicts $Y_i$. For example, the quadratic function of $X_i$ that best predicts $Y_i$ would be\n$$ \nm(X_i, X_i^2) = \\beta_0 + \\beta_1X_i + \\beta_2X_i^2 \\quad\\text{where}\\quad \\argmin_{(b_0,b_1,b_2) \\in \\mathbb{R}^3}\\;\\E[(Y_{i} - b_{0} - b_{1}X_{i} - b_{2}X_{i}^{2})^{2}].\n$$\nThis equation is now a quadratic function of the covariates, but it is still a linear function of the unknown parameters $(\\beta_{0}, \\beta_{1}, \\beta_{2})$ so we will call this a best linear predictor. \n\nWe could include higher order terms of $X_i$ in the same manner, and as we include more polynomial terms, $X_i^p$, the more flexible the function of $X_i$ we will capture with the BLP. When we estimate the BLP, however, we usually will pay for this flexibility in terms of overfitting and high variance in our estimates. \n\n\n### Linear prediction with multiple covariates \n\nWe now generalize the idea of a best linear predictor to a setting with an arbitrary number of covariates. In this setting, remember that the linear function will be\n\n$$ \n\\bfx'\\bfbeta = x_{1}\\beta_{1} + x_{2}\\beta_{2} + \\cdots + x_{k}\\beta_{k}.\n$$\nWe will define the **best linear predictor** (BLP) to be\n$$ \nm(\\bfx) = \\bfx'\\bfbeta, \\quad \\text{where}\\quad \\bfbeta = \\argmin_{\\mb{b} \\in \\real^k}\\; \\E\\bigl[ \\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\\bigr]\n$$\n\nThis BLP solves the same fundamental optimization problem as in the bivariate case: it chooses the set of coefficients that minimizes the mean-squared error averaging over the joint distribution of the data. \n\n\n\n::: {.callout-note}\n## Best linear projection assumptions\n\nWithout some assumptions on the joint distribution of the data, the following \"regularity conditions\" will ensure the existence of the BLP:\n\n1. $\\E[Y^2] < \\infty$ (outcome has finite mean/variance)\n2. $\\E\\Vert \\mb{X} \\Vert^2 < \\infty$ ($\\mb{X}$ has finite means/variances/covariances)\n3. $\\mb{Q}_{\\mb{XX}} = \\E[\\mb{XX}']$ is positive definite (columns of $\\X$ are linearly independent) \n:::\n\nUnder these assumptions, it is possible to derive a closed-form expression for the **population coefficients** $\\bfbeta$ using matrix calculus. To set up the optimization problem, we will find the first-order condition by taking the derivative of the expectation of the squared errors. First, let's take the derivative of the squared prediction errors using the chain rule:\n$$ \n\\begin{aligned}\n \\frac{\\partial}{\\partial \\mb{b}}\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)^{2}\n &= 2\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)\\frac{\\partial}{\\partial \\mb{b}}(Y_{i} - \\X_{i}'\\mb{b}) \\\\\n &= -2\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)\\X_{i} \\\\\n &= -2\\X_{i}\\left(Y_{i} - \\X_{i}'\\mb{b}\\right) \\\\\n &= -2\\left(\\X_{i}Y_{i} - \\X_{i}\\X_{i}'\\mb{b}\\right),\n\\end{aligned}\n$$\nwhere the third equality comes from the fact that $(Y_{i} - \\X_{i}'\\bfbeta)$ is a scalar. We can now plug this into the expectation to get the first-order condition and solve for $\\bfbeta$,\n$$ \n\\begin{aligned}\n 0 &= -2\\E[\\X_{i}Y_{i} - \\X_{i}\\X_{i}'\\bfbeta ] \\\\\n \\E[\\X_{i}\\X_{i}'] \\bfbeta &= \\E[\\X_{i}Y_{i}],\n\\end{aligned}\n$$\nwhich implies the population coefficients are\n$$ \n\\bfbeta = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] = \\mb{Q}_{\\mb{XX}}^{-1}\\mb{Q}_{\\mb{X}Y}\n$$\nWe now have an expression for the coefficients for the population best linear predictor in terms of the joint distribution $(Y_{i}, \\X_{i})$. A couple of facts might be useful for reasoning this expression. Recall that $\\mb{Q}_{\\mb{XX}} = \\E[\\X_{i}\\X_{i}']$ is a $k\\times k$ matrix and $\\mb{Q}_{\\X Y} = \\E[\\X_{i}Y_{i}]$ is a $k\\times 1$ column vector, which implies that $\\bfbeta$ is also a $k \\times 1$ column vector. \n\n::: {.callout-note}\n\nIntuitively, what is happening in the expression for the population regression coefficients? It is helpful to separate the intercept or constant term so that we have\n$$ \nY_{i} = \\beta_{0} + \\X'\\bfbeta + e_{i},\n$$\nso $\\bfbeta$ refers to just the vector of coefficients for the covariates. In this case, we can write the coefficients in a more interpretable way:\n$$ \n\\bfbeta = \\V[\\X]^{-1}\\text{Cov}(\\X, Y), \\qquad \\beta_0 = \\mu_Y - \\mb{\\mu}'_{\\mb{X}}\\bfbeta\n$$\n\nThus, the population coefficients take the covariance between the outcome and the covariates and \"divide\" it by information about variances and covariances of the covariates. The intercept recenters the regression so that projection errors are mean zero. Thus, we can see that these coefficients generalize the bivariate formula to this multiple covariate context. \n:::\n\nWith an expression for the population linear regression coefficients, we can write the linear projection as \n$$ \nm(\\X_{i}) = \\X_{i}'\\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] = \\X_{i}'\\mb{Q}_{\\mb{XX}}^{-1}\\mb{Q}_{\\mb{X}Y}\n$$\n\n\n\n### Projection error\n\nThe **projection error** is the difference between the actual value of $Y_i$ and the projection,\n$$ \ne_{i} = Y_{i} - m(\\X_{i}) = Y_i - \\X_{i}'\\bfbeta,\n$$\nwhere we have made no assumptions about this error yet. The projection error is simply the prediction error of the best linear prediction. Rewriting this definition, we can see that we can always write the outcome as the linear projection plus the projection error,\n$$ \nY_{i} = \\X_{i}'\\bfbeta + e_{i}.\n$$\nNotice that this looks suspiciously similar to a linearity assumption on the CEF, but we haven't made any assumptions here. Instead, we have just used the definition of the projection error to write a tautological statement: \n$$ \nY_{i} = \\X_{i}'\\bfbeta + e_{i} = \\X_{i}'\\bfbeta + Y_{i} - \\X_{i}'\\bfbeta = Y_{i}.\n$$\nThe critical difference between this representation and the usual linear model assumption is what properties $e_{i}$ possesses. \n\nOne key property of the projection errors is that when the covariate vector includes an \"intercept\" or constant term, the projection errors are uncorrelated with the covariates. To see this, we first note that $\\E[\\X_{i}e_{i}] = 0$ since\n$$ \n\\begin{aligned}\n \\E[\\X_{i}e_{i}] &= \\E[\\X_{{i}}(Y_{i} - \\X_{i}'\\bfbeta)] \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}\\X_{i}']\\bfbeta \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}\\X_{i}']\\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}Y_{i}] = 0\n\\end{aligned}\n$$\nThus, for every $X_{ij}$ in $\\X_{i}$, we have $\\E[X_{ij}e_{i}] = 0$. If one of the entries in $\\X_i$ is a constant 1, then this also implies that $\\E[e_{i}] = 0$. Together, these facts imply that the projection error is uncorrelated with each $X_{ij}$, since\n$$ \n\\cov(X_{ij}, e_{i}) = \\E[X_{ij}e_{i}] - \\E[X_{ij}]\\E[e_{i}] = 0 - 0 = 0\n$$\nNotice that we still have made no assumptions about these projection errors except for some mild regularity conditions on the joint distribution of the outcome and covariates. Thus, in very general settings, we can write the linear projection model $Y_i = \\X_i'\\bfbeta + e_i$ where $\\bfbeta = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}]$ and conclude that $\\E[\\X_{i}e_{i}] = 0$ by definition, not by assumption. \n\nThe projection error is uncorrelated with the covariates, so does this mean that the CEF is linear? Unfortunately, no. Recall that while independence implies uncorrelated, the reverse does not hold. So when we look at the CEF, we have\n$$ \n\\E[Y_{i} \\mid \\X_{i}] = \\X_{i}'\\bfbeta + \\E[e_{i} \\mid \\X_{i}],\n$$\nand the last term $\\E[e_{i} \\mid \\X_{i}]$ would only be 0 if the errors were independent of the covariates, so $\\E[e_{i} \\mid \\X_{i}] = \\E[e_{i}] = 0$. But nowhere in the linear projection model did we assume this. So while we can (almost) always write the outcome as $Y_i = \\X_i'\\bfbeta + e_i$ and have those projection errors be uncorrelated with the covariates, it will require additional assumptions to ensure that the true CEF is, in fact, linear $\\E[Y_{i} \\mid \\X_{i}] = \\X_{i}'\\bfbeta$. \n\nLet's take a step back. What have we shown here? In a nutshell, we have shown that a population linear regression exists under very general conditions, and we can write the coefficients of that population linear regression as a function of expectations of the joint distribution of the data. We did not assume that the CEF was linear nor that the projection errors were normal. \n\n\nWhy do we care about this? The ordinary least squares estimator, the workhorse regression estimator, targets this quantity of interest in large samples, regardless of whether the true CEF is linear or not. Thus, even when a linear CEF assumption is incorrect, OLS still targets a perfectly valid quantity of interest: the coefficients from this population linear projection. \n\n## Linear CEFs without assumptions\n\nWhat is the relationship between the best linear predictor (which we just saw generally exists) and the CEF? To draw the connection, remember a vital property of the conditional expectation: it is the function of $\\X_i$ that best predicts $Y_{i}$. The population regression was the best **linear** predictor, but the CEF is the best predictor among all nicely behaved functions of $\\X_{i}$, linear or nonlinear. In particular, if we label $L_2$ to be the set of all functions of the covariates $g()$ that have finite squared expectation, $\\E[g(\\X_{i})^{2}] < \\infty$, then we can show that the CEF has the lowest squared prediction error in this class of functions:\n$$ \n\\mu(\\X) = \\E[Y_{i} \\mid \\X_{i}] = \\argmin_{g(\\X_i) \\in L_2}\\; \\E\\left[(Y_{i} - g(\\X_{i}))^{2}\\right],\n$$\n\nSo we have established that the CEF is the best predictor and the population linear regression $m(\\X_{i})$ is the best linear predictor. These two facts allow us to connect the CEF and the population regression.\n\n::: {#thm-cef-blp}\nIf $\\mu(\\X_{i})$ is a linear function of $\\X_i$, then $\\mu(\\X_{i}) = m(\\X_{i}) = \\X_i'\\bfbeta$. \n\n:::\n\nThis theorem says that if the true CEF is linear, it equals the population linear regression. The proof of this is straightforward: the CEF is the best predictor, so if it is linear, it must also be the best linear predictor. \n \n \nIn general, we are in the business of learning about the CEF, so we are unlikely to know if it genuinely is linear or not. In some situations, however, we can show that the CEF is linear without any additional assumptions. These will be situations when the covariates take on a finite number of possible values. Suppose we are interested in the CEF of poll wait times for Black ($X_i = 1$) vs. non-Black ($X_i = 0$) voters. In this case, there are two possible values of the CEF, $\\mu(1) = \\E[Y_{i}\\mid X_{i}= 1]$, the average wait time for Black voters, and $\\mu(0) = \\E[Y_{i}\\mid X_{i} = 0]$, the average wait time for non-Black voters. Notice that we can write the CEF as\n$$ \n\\mu(x) = x \\mu(1) + (1 - x) \\mu(0) = \\mu(0) + x\\left(\\mu(1) - \\mu(0)\\right)= \\beta_0 + x\\beta_1,\n$$\nwhich is clearly a linear function of $x$. Based on this derivation, we can see that the coefficients of this linear CEF have a clear interpretation:\n\n- $\\beta_0 = \\mu(0)$: the expected wait time for a Black voter. \n- $\\beta_1 = \\mu(1) - \\mu(0)$: the difference in average wait times between Black and non-Black voters. \nNotice that it matters how $X_{i}$ is defined here since the intercept will always be the average outcome when $X_i = 0$, and the slope will always be the difference in means between the $X_i = 1$ group and the $X_i = 0$ group. \n\nWhat about a categorical covariate with more than two levels? For instance, we might be interested in wait times by party identification, where $X_i = 1$ indicates Democratic voters, $X_i = 2$ indicates Republican voters, and $X_i = 3$ indicates independent voters. How can we write the CEF of wait times as a linear function of this variable? That would assume that the difference between Democrats and Republicans is the same as for Independents and Republicans. With more than two levels, we can represent a categorical variable as a vector of binary variables, $\\X_i = (X_{i1}, X_{i2})$, where\n$$ \n\\begin{aligned}\n X_{{i1}} &= \\begin{cases}\n 1&\\text{if Republican} \\\\\n 0 & \\text{if not Republican}\n \\end{cases} \\\\\nX_{{i2}} &= \\begin{cases}\n 1&\\text{if independent} \\\\\n 0 & \\text{if not independent}\n \\end{cases} \\\\\n\\end{aligned}\n$$\nThese two indicator variables encode the same information as the original three-level variable, $X_{i}$. If I know the values of $X_{i1}$ and $X_{i2}$, I know exactly what party to which $i$ belongs. Thus, the CEFs for $X_i$ and the pair of indicator variables, $\\X_i$, are precisely the same, but the latter admits a lovely linear representation,\n$$\n\\E[Y_i \\mid X_{i1}, X_{i2}] = \\beta_0 + \\beta_1 X_{i1} + \\beta_2 X_{i2},\n$$\nwhere\n\n- $\\beta_0 = \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the average wait time for the group who does not get an indicator variable (Democrats in this case). \n- $\\beta_1 = \\E[Y_{i} \\mid X_{i1} = 1, X_{i2} = 0] - \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the difference in means between Republican voters and Democratic voters, or the difference between the first indicator group and the baseline group. \n- $\\beta_2 = \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 1] - \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the difference in means between independent voters and Democratic voters, or the difference between the second indicator group and the baseline group.\n\nThis approach easily generalizes to categorical variables with an arbitrary number of levels. \n\nWhat have we shown? The CEF will be linear without additional assumptions when there is a categorical covariate. We can show that this continues to hold even when we have multiple categorical variables. We now have two binary covariates: $X_{i1}=1$ indicating a Black voter, and $X_{i2} = 1$ indicating an urban voter. With these two binary variables, there are four possible values of the CEF:\n$$ \n\\mu(x_1, x_2) = \\begin{cases} \n \\mu_{00} & \\text{if } x_1 = 0 \\text{ and } x_2 = 0 \\text{ (non-Black, rural)} \\\\\n \\mu_{10} & \\text{if } x_1 = 1 \\text{ and } x_2 = 0 \\text{ (Black, rural)} \\\\\n \\mu_{01} & \\text{if } x_1 = 0 \\text{ and } x_2 = 1 \\text{ (non-Black, urban)} \\\\\n \\mu_{11} & \\text{if } x_1 = 1 \\text{ and } x_2 = 1 \\text{ (Black, urban)}\n \\end{cases}\n$$\nWe can write this as\n$$ \n\\mu(x_{1}, x_{2}) = (1 - x_{1})(1 - x_{2})\\mu_{00} + x_{1}(1 -x_{2})\\mu_{10} + (1-x_{1})x_{2}\\mu_{01} + x_{1}x_{2}\\mu_{11},\n$$\nwhich we can rewrite as \n$$ \n\\mu(x_1, x_2) = \\beta_0 + x_1\\beta_1 + x_2\\beta_2 + x_1x_2\\beta_3,\n$$\nwhere\n\n- $\\beta_0 = \\mu_{00}$: average wait times for rural non-Black voters. \n- $\\beta_1 = \\mu_{10} - \\mu_{00}$: difference in means for rural Black vs. rural non-Black voters. \n- $\\beta_2 = \\mu_{01} - \\mu_{00}$: difference in means for urban non-Black vs. rural non-Black voters. \n- $\\beta_3 = (\\mu_{11} - \\mu_{01}) - (\\mu_{10} - \\mu_{00})$: difference in urban racial difference vs rural racial difference.\n\nThus, we can write the CEF with two binary covariates as linear when the linear specification includes a multiplicative interaction between them ($x_1x_2$). This result holds for all pairs of binary covariates, and we can generalize the interpretation of the coefficients in the CEF as\n\n- $\\beta_0 = \\mu_{00}$: average outcome when both variables are 0. \n- $\\beta_1 = \\mu_{10} - \\mu_{00}$: difference in average outcomes for the first covariate when the second covariate is 0. \n- $\\beta_2 = \\mu_{01} - \\mu_{00}$: difference in average outcomes for the second covariate when the first covariate is 0. \n- $\\beta_3 = (\\mu_{11} - \\mu_{01}) - (\\mu_{10} - \\mu_{00})$: change in the \"effect\" of the first (second) covariate when the second (first) covariate goes from 0 to 1. \n\nThis result also generalizes to an arbitrary number of binary covariates. If we have $p$ binary covariates, then the CEF will be linear with all two-way interactions, $x_1x_2$, all three-way interactions, $x_1x_2x_3$, up to the $p$-way interaction $x_1\\times\\cdots\\times x_p$. Furthermore, we can generalize to arbitrary numbers of categorical variables by expanding each into a series of binary variables and then including all interactions between the resulting binary variables. \n\n\nWe have established that when we have a set of categorical covariates, the true CEF will be linear, and we have seen the various ways to represent that CEF. Notice that when we use, for example, ordinary least squares, we are free to choose how to include our variables. That means that we could run a regression of $Y_i$ on $X_{i1}$ and $X_{i2}$ without an interaction term. This model will only be correct if $\\beta_3$ is equal to 0, and so the interaction term is irrelevant. Because of this ability to choose our models, it's helpful to have a language for models that capture the linear CEF appropriately. We call a model **saturated** if there are as many coefficients as the CEF's unique values. A saturated model, by its nature, can always be written as a linear function without assumptions. The above examples show how to construct saturated models in various situations.\n\n## Interpretation of the regression coefficients\n\nWe have seen how to interpret population regression coefficients when the CEF is linear without assumptions. How do we interpret the population coefficients $\\bfbeta$ in other settings? \n\n\nLet's start with the simplest case, where every entry in $\\X_{i}$ represents a different covariate and no covariate is any function of another (we'll see why this caveat is necessary below). In this simple case, the $k$th coefficient, $\\beta_{k}$, will represent the change in the predicted outcome for a one-unit change in the $k$th covariate $X_{ik}$, holding all other covariates fixed. We can see this from \n$$ \n\\begin{aligned}\n m(x_{1} + 1, x_{2}) & = \\beta_{0} + \\beta_{1}(x_{1} + 1) + \\beta_{2}x_{2} \\\\\n m(x_{1}, x_{2}) &= \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{2},\n\\end{aligned} \n$$\nso that the change in the predicted outcome for increasing $X_{i1}$ by one unit is\n$$\n m(x_{1} + 1, x_{2}) - m(x_{1}, x_{2}) = \\beta_1\n$$\nNotice that nothing changes in this interpretation if we add more covariates to the vector,\n$$\n m(x_{1} + 1, \\bfx_{2}) - m(x_{1}, \\bfx_{2}) = \\beta_1,\n$$\nthe coefficient on a particular variable is the change in the predicted outcome for a one-unit change in the covariate holding all other covariates constant. Each coefficient summarizes the \"all else equal\" difference in the predicted outcome for each covariate. \n\n\n### Polynomial functions of the covariates\n\n\n\nThe interpretation of the population regression coefficients becomes more complicated when we include nonlinear functions of the covariates. In that case, multiple coefficients control how a change in a covariate will change the predicted value of $Y_i$. Suppose that we have a quadratic function of $X_{i1}$,\n$$ \nm(x_1, x_1^2, x_{2}) = \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{1}^{2} + \\beta_{3}x_{2},\n$$\nand try to look at a one-unit change in $x_1$,\n$$ \n\\begin{aligned}\n m(x_{1} + 1, (x_{1} + 1)^{2}, x_{2}) & = \\beta_{0} + \\beta_{1}(x_{1} + 1) + \\beta_{2}(x_{1} + 1)^{2}+ \\beta_{3}x_{2} \\\\\n m(x_{1}, x_{1}^{2}, x_{2}) &= \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{1}^{2} + \\beta_{3}x_{2},\n\\end{aligned} \n$$\nresulting in $\\beta_1 + \\beta_2(2x_{1} + 1)$. This formula might be an interesting quantity, but we will more commonly use the derivative of $m(\\bfx)$ with respect to $x_1$ as a measure of the marginal effect of $X_{i1}$ on the predicted value of $Y_i$ (holding all other variables constant), where \"marginal\" here means the change in prediction for a very small change in $X_{i1}$.[^effect] In the case of the quadratic covariate, we have\n$$ \n\\frac{\\partial m(x_{1}, x_{1}^{2}, x_{2})}{\\partial x_{1}} = \\beta_{1} + 2\\beta_{2}x_{1},\n$$\nso the marginal effect on prediction varies as a function of $x_1$. From this, we can see that the individual interpretations of the coefficients are less interesting: $\\beta_1$ is the marginal effect when $X_{i1} = 0$ and $\\beta_2 / 2$ describes how a one-unit change in $X_{i1}$ changes the marginal effect. As is hopefully clear, it will often be more straightforward to visualize the nonlinear predictor function (perhaps using the orthogonalization techniques in @sec-fwl). \n\n\n[^effect]: Notice the choice of language here. The marginal effect is on the predicted value of $Y_i$, not on $Y_i$ itself. So these marginal effects are associational, not necessarily causal quantities. \n\n### Interactions\n\nAnother common nonlinear function of the covariates is when we include **interaction terms** or covariates that are products of two other covariates,\n$$ \nm(x_{1}, x_{2}, x_{1}x_{2}) = \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{2} + \\beta_{3}x_{1}x_{2}.\n$$\nIn these situations, we can also use the derivative of the BLP to measure the marginal effect of one variable or the other on the predicted value of $Y_i$. In particular, we have\n$$ \n\\begin{aligned}\n \\frac{\\partial m(x_{1}, x_{2}, x_{1}x_{2})}{\\partial x_1} &= \\beta_1 + \\beta_3x_2, \\\\\n \\frac{\\partial m(x_{1}, x_{2}, x_{1}x_{2})}{\\partial x_2} &= \\beta_2 + \\beta_3x_1.\n\\end{aligned}\n$$\nHere, the coefficients are slightly more interpretable:\n\n* $\\beta_1$: the marginal effect of $X_{i1}$ on predicted $Y_i$ when $X_{i2} = 0$.\n* $\\beta_2$: the marginal effect of $X_{i2}$ on predicted $Y_i$ when $X_{i1} = 0$.\n* $\\beta_3$: the change in the marginal effect of $X_{i1}$ due to a one-unit change in $X_{i2}$ **OR** the change in the marginal effect of $X_{i2}$ due to a one-unit change in $X_{i2}$.\n\nIf we add more covariates to this BLP, these interpretations change to \"holding all other covariates constant.\"\n\nInteractions are a routine part of social science research because they allow us to assess how the relationship between the outcome and an independent variable varies by the values of another variable. In the context of our study of voter wait times, if $X_{i1}$ is income and $X_{i2}$ is the Black/non-Black voter indicator, then $\\beta_3$ represents the change in the slope of the wait time-income relationship between Black and non-Black voters. \n\n\n## Multiple regression from bivariate regression {#sec-fwl}\n\nWhen we have a regression of an outcome on two covariates, it is helpful to understand how the coefficients of one variable relate to the other. For example, if we have the following best linear projection:\n$$ \n(\\alpha, \\beta, \\gamma) = \\argmin_{(a,b,c) \\in \\mathbb{R}^{3}} \\; \\E[(Y_{i} - (a + bX_{i} + cZ_{i}))^{2}]\n$$ {#eq-two-var-blp}\nIs there some way to understand the $\\beta$ coefficient here regarding simple linear regression? As it turns out, yes. From the above results, we know that the intercept has a simple form:\n$$\n\\alpha = \\E[Y_i] - \\beta\\E[X_i] - \\gamma\\E[Z_i].\n$$\nLet's investigate the first order condition for $\\beta$:\n$$ \n\\begin{aligned}\n 0 &= \\E[Y_{i}X_{i}] - \\alpha\\E[X_{i}] - \\beta\\E[X_{i}^{2}] - \\gamma\\E[X_{i}Z_{i}] \\\\\n &= \\E[Y_{i}X_{i}] - \\E[Y_{i}]\\E[X_{i}] + \\beta\\E[X_{i}]^{2} + \\gamma\\E[X_{i}]\\E[Z_{i}] - \\beta\\E[X_{i}^{2}] - \\gamma\\E[X_{i}Z_{i}] \\\\\n &= \\cov(Y, X) - \\beta\\V[X_{i}] - \\gamma \\cov(X_{i}, Z_{i})\n\\end{aligned}\n$$\nWe can see from this that if $\\cov(X_{i}, Z_{i}) = 0$, then the coefficient on $X_i$ will be the same as in the simple regression case, $\\cov(Y_{i}, X_{i})/\\V[X_{i}]$. When $X_i$ and $Z_i$ are uncorrelated, we sometimes call them **orthogonal**. \n\nTo write a simple formula for $\\beta$ when the covariates are not orthogonal, we will **orthogonalize** $X_i$ by obtaining the prediction errors from a population linear regression of $X_i$ on $Z_i$:\n$$ \n\\widetilde{X}_{i} = X_{i} - (\\delta_{0} + \\delta_{1}Z_{i}) \\quad\\text{where}\\quad (\\delta_{0}, \\delta_{1}) = \\argmin_{(d_{0},d_{1}) \\in \\mathbb{R}^{2}} \\; \\E[(X_{i} - (d_{0} + d_{1}Z_{i}))^{2}]\n$$\nGiven the properties of projection errors, we know that this orthogonalized version of $X_{i}$ will be uncorrelated with $Z_{i}$ since $\\E[\\widetilde{X}_{i}Z_{i}] = 0$. Remarkably, the coefficient on $X_i$ from the \"long\" BLP in @eq-two-var-blp is the same as the regression of $Y_i$ on this orthogonalized $\\widetilde{X}_i$, \n$$ \n\\beta = \\frac{\\text{cov}(Y_{i}, \\widetilde{X}_{i})}{\\V[\\widetilde{X}_{i}]}\n$$\n\nWe can expand this idea to when there are several other covariates. Suppose now that we are interested in a regression of $Y_i$ on $\\X_i$ and we are interested in the coefficient on the $k$th covariate. Let $\\X_{i,-k}$ be the vector of covariates omitting the $k$th entry and let $m_k(\\X_{i,-k})$ represent the BLP of $X_{ik}$ on these other covariates. We can define $\\widetilde{X}_{ik} = X_{ik} - m_{k}(\\X_{i,-k})$ as the $k$th variable orthogonalized with respect to the rest of the variables and we can write the coefficient on $X_{ik}$ as\n$$ \n\\beta_k = \\frac{\\cov(Y_i, \\widetilde{X}_{ik})}{\\V[\\widetilde{X}_{ik}]}.\n$$ \nThus, the population regression coefficient in the BLP is the same as from a bivariate regression of the outcome on the projection error for $X_{ik}$ projected on all other covariates. One interpretation of coefficients in a population multiple regression is they represent the relationship between the outcome and the covariate after removing the linear relationships of all other variables. \n\n\n## Omitted variable bias\n\nIn many situations, we might need to choose whether to include a variable in a regression or not, so it can be helpful to understand how this choice might affect the population coefficients on the other variables in the regression. Suppose we have a variable $Z_i$ that we may add to our regression which currently has $\\X_i$ as the covariates. We can write this new projection as \n$$ \nm(\\X_i, Z_i) = \\X_i'\\bfbeta + Z_i\\gamma, \\qquad m(\\X_{i}) = \\X_i'\\bs{\\delta},\n$$\nwhere we often refer to $m(\\X_i, Z_i)$ as the long regression and $m(\\X_i)$ as the short regression. \n\nWe know from the definition of the BLP that we can write the short coefficients as \n$$ \n\\bs{\\delta} = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1} \\E[\\X_{i}Y_{i}].\n$$\nLetting $e_i = Y_i - m(\\X_{i}, Z_{i})$ be the projection errors from the long regression, we can write this as\n$$ \n\\begin{aligned}\n \\bs{\\delta} &= \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1} \\E[\\X_{i}(\\X_{i}'\\bfbeta + Z_{i}\\gamma + e_{i})] \\\\\n &= \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}(\\E[\\X_{i}\\X_{i}']\\bfbeta + \\E[\\X_{i}Z_{i}]\\gamma + \\E[\\X_{i}e_{i}]) \\\\\n &= \\bfbeta + \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Z_{i}]\\gamma\n\\end{aligned}\n$$\nNotice that the vector in the second term is the linear projection coefficients of a population linear regression of $Z_i$ on the $\\X_i$. If we call these coefficients $\\bs{\\pi}$, then the short coefficients are \n$$ \n\\bs{\\delta} = \\bfbeta + \\bs{\\pi}\\gamma. \n$$\n\nWe can rewrite this to show that the difference between the coefficients in these two projections is $\\bs{\\delta} - \\bfbeta= \\bs{\\pi}\\gamma$ or the product of the coefficient on the \"excluded\" $Z_i$ and the coefficient of the included $\\X_i$ on the excluded. Most textbooks refer to this difference as the **omitted variable bias** of omitting $Z_i$ under the idea that $\\bfbeta$ is the true target of inference. But the result is much broader than this since it just tells us how to relate the coefficients of two nested projections. \n\n\nThe last two results (multiple regressions from bivariate and omitted variable bias) are sometimes presented as results for the ordinary least squares estimator that we will show in the next chapter. We introduce them here as features of a particular population quantity, the linear projection or population linear regression. \n\n\n## Drawbacks of the BLP\n\nThe best linear predictor is, of course, a *linear* approximation to the CEF, and this approximation could be quite poor if the true CEF is highly nonlinear. A more subtle issue with the BLP is that it is sensitive to the marginal distribution of the covariates when the CEF is nonlinear. Let's return to our example of voter wait times and income. In @fig-blp-limits, we show the true CEF and the BLP when we restrict income below \\$50,000 or above \\$100,000. The BLP can vary quite dramatically here. This figure is an extreme example, but the essential point will still hold as the marginal distribution of $X_i$ changes.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Linear projections for when truncating income distribution below $50k and above $100k.](06_linear_model_files/figure-pdf/fig-blp-limits-1.pdf){#fig-blp-limits}\n:::\n:::\n", + "markdown": "# Linear regression {#sec-regression}\n\n\nRegression is a tool for assessing the relationship between an **outcome variable**, $Y_i$, and a set of **covariates**, $\\X_i$. In particular, these tools show how the conditional mean of $Y_i$ varies as a function of $\\X_i$. For example, we may want to know how voting poll wait times vary as a function of some socioeconomic features of the precinct, like income and racial composition. We usually accomplish this task by estimating the **regression function** or **conditional expectation function** (CEF) of the outcome given the covariates, \n$$\n\\mu(\\bfx) = \\E[Y_i \\mid \\X_i = \\bfx].\n$$\nWhy are estimation and inference for this regression function special? Why can't we just use the approaches we have seen for the mean, variance, covariance, and so on? The fundamental problem with the CEF is that there may be many, many values $\\bfx$ that can occur and many different conditional expectations that we will need to estimate. If any variable in $\\X_i$ is continuous, we must estimate an infinite number of possible values of $\\mu(\\bfx)$. Because it worsens as we add covariates to $\\X_i$, we refer to this problem as the **curse of dimensionality**. How can we resolve this with our measly finite data?\n\nIn this chapter, we will explore two ways of \"solving\" the curse of dimensionality: assuming it away and changing the quantity of interest to something easier to estimate. \n\n\nRegression is so ubiquitous in many scientific fields that it has a lot of acquired notational baggage. In particular, the labels of the $Y_i$ and $\\X_i$ vary greatly:\n\n- The outcome can also be called: the response variable, the dependent variable, the labels (in machine learning), the left-hand side variable, or the regressand. \n- The covariates are also called: the explanatory variables, the independent variables, the predictors, the regressors, inputs, or features. \n\n\n## Why do we need models?\n\nAt first glance, the connection between the CEF and parametric models might be hazy. For example, imagine we are interested in estimating the average poll wait times ($Y_i$) for Black voters ($X_i = 1$) versus non-Black voters ($X_i=0$). In that case, there are two parameters to estimate, \n$$\n\\mu(1) = \\E[Y_i \\mid X_i = 1] \\quad \\text{and}\\quad \\mu(0) = \\E[Y_i \\mid X_i = 0],\n$$\nwhich we could estimate by using the plug-in estimators that replace the population averages with their sample counterparts,\n$$ \n\\widehat{\\mu}(1) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 1)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 1)} \\qquad \\widehat{\\mu}(0) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 0)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 0)}.\n$$\nThese are just the sample averages of the wait times for Black and non-Black voters, respectively. And because the race variable here is discrete, we are simply estimating sample means within subpopulations defined by race. The same logic would apply if we had $k$ racial categories: we would have $k$ conditional expectations to estimate and $k$ (conditional) sample means. \n\nNow imagine that we want to know how the average poll wait time varies as a function of income so that $X_i$ is (essentially) continuous. Now we have a different conditional expectation for every possible dollar amount from 0 to Bill Gates's income. Imagine we pick a particular income, \\$42,238, and so we are interested in the conditional expectation $\\mu(42,238)= \\E[Y_{i}\\mid X_{i} = 42,238]$. We could use the same plug-in estimator in the discrete case, \n$$\n\\widehat{\\mu}(42,238) = \\frac{\\sum_{i=1}^{n} Y_{i}\\mathbb{1}(X_{i} = 42,238)}{\\sum_{i=1}^{n}\\mathbb{1}(X_{i} = 42,238)}.\n$$\nWhat is the problem with this estimator? In all likelihood, no units in any particular dataset have that exact income, meaning this estimator is undefined (we would be dividing by zero). \n\n\nOne solution to this problem is to use **subclassification**, turn the continuous variable into a discrete one, and proceed with the discrete approach above. We might group incomes into \\$25,000 bins and then calculate the average wait times of anyone between, say, \\$25,000 and \\$50,000 income. When we make this estimator switch for practical purposes, we need to connect it back to the DGP of interest. We could **assume** that the CEF of interest only depends on these binned means, which would mean we have: \n$$\n\\mu(x) = \n\\begin{cases}\n \\E[Y_{i} \\mid 0 \\leq X_{i} < 25,000] &\\text{if } 0 \\leq x < 25,000 \\\\\n \\E[Y_{i} \\mid 25,000 \\leq X_{i} < 50,000] &\\text{if } 25,000 \\leq x < 50,000\\\\\n \\E[Y_{i} \\mid 50,000 \\leq X_{i} < 100,000] &\\text{if } 50,000 \\leq x < 100,000\\\\\n \\vdots \\\\\n \\E[Y_{i} \\mid 200,000 \\leq X_{i}] &\\text{if } 200,000 \\leq x\\\\\n\\end{cases}\n$$\nThis approach assumes, perhaps incorrectly, that the average wait time does not vary within the bins. @fig-cef-binned shows a hypothetical joint distribution between income and wait times with the true CEF, $\\mu(x)$, shown in red. The figure also shows the bins created by subclassification and the implied CEF if we assume bin-constant means in blue. We can see that the blue function approximates the true CEF but deviates from it close to the bin edges. The trade-off is that once we make the assumption, we only have to estimate one mean for every bin rather than an infinite number of means for each possible income. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of income and poll wait times (contour plot), conditional expectation function (red), and the conditional expectation of the binned income (blue).](06_linear_model_files/figure-pdf/fig-cef-binned-1.pdf){#fig-cef-binned}\n:::\n:::\n\n\n\nSimilarly, we could **assume** that the CEF follows a simple functional form like a line,\n$$ \n\\mu(x) = \\E[Y_{i}\\mid X_{i} = x] = \\beta_{0} + \\beta_{1} x.\n$$\nThis assumption reduces our infinite number of unknowns (the conditional mean at every possible income) to just two unknowns: the slope and intercept. As we will see, we can use the standard ordinary least squares to estimate these parameters. Notice again that if the true CEF is nonlinear, this assumption is incorrect, and any estimate based on this assumption might be biased or even inconsistent. \n\nWe call the binning and linear assumptions on $\\mu(x)$ **functional form** assumptions because they restrict the class of functions that $\\mu(x)$ can take. While powerful, these types of assumptions can muddy the roles of defining the quantity of interest and estimation. If our estimator $\\widehat{\\mu}(x)$ performs poorly, it will be difficult to tell if this is because the estimator is flawed or our functional form assumptions are incorrect. \n\nTo help clarify these issues, we will pursue a different approach: understanding what linear regression can estimate under minimal assumptions and then investigating how well this estimand approximates the true CEF. \n\n## Population linear regression {#sec-linear-projection}\n\n### Bivariate linear regression \n\n\nLet's set aside the idea of the conditional expectation function and instead focus on finding the **linear** function of a single covariate $X_i$ that best predicts the outcome. Remember that linear functions have the form $a + bX_i$. The **best linear predictor** (BLP) or **population linear regression** of $Y_i$ on $X_i$ is defined as\n$$ \nm(x) = \\beta_0 + \\beta_1 x \\quad\\text{where, }\\quad (\\beta_{0}, \\beta_{1}) = \\argmin_{(b_{0}, b_{1}) \\in \\mathbb{R}^{2}}\\; \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}].\n$$\nThat is, the best linear predictor is the line that results in the lowest mean-squared error predictions of the outcome given the covariates, averaging over the joint distribution of the data. This function is a feature of the joint distribution of the data---the DGP---and so represents something that we would like to learn about with our sample. It is an alternative to the CEF for summarizing the relationship between the outcome and the covariate, though we will see that they will sometimes be equal. We call $(\\beta_{0}, \\beta_{1})$ the **population linear regression coefficients**. Notice that $m(x)$ could differ greatly from the CEF $\\mu(x)$ if the latter is nonlinear. \n\nWe can solve for the best linear predictor using standard calculus (taking the derivative with respect to each coefficient, setting those equations equal to 0, and solving the system of equations). The first-order conditions, in this case, are\n$$ \n\\begin{aligned}\n \\frac{\\partial \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}]}{\\partial b_{0}} = \\E[-2(Y_{i} - \\beta_{0} - \\beta_{1}X_{i})] = 0 \\\\\n \\frac{\\partial \\E[(Y_{i} - b_{0} - b_{1}X_{i} )^{2}]}{\\partial b_{1}} = \\E[-2(Y_{i} - \\beta_{0} - \\beta_{1}X_{i})X_{i}] = 0\n\\end{aligned} \n$$\nGiven the linearity of expectations, it is easy to solve for $\\beta_0$ in terms of $\\beta_1$,\n$$ \n\\beta_{0} = \\E[Y_{i}] - \\beta_{1}\\E[X_{i}].\n$$\nWe can plug this into the first-order condition for $\\beta_1$ to get\n$$ \n\\begin{aligned}\n 0 &= \\E[Y_{i}X_{i}] - (\\E[Y_{i}] - \\beta_{1}\\E[X_{i}])\\E[X_{i}] - \\beta_{1}\\E[X_{i}^{2}] \\\\\n &= \\E[Y_{i}X_{i}] - \\E[Y_{i}]\\E[X_{i}] - \\beta_{1}(\\E[X_{i}^{2}] - \\E[X_{i}]^{2}) \\\\\n &= \\cov(X_{i},Y_{i}) - \\beta_{1}\\V[X_{i}]\\\\\n \\beta_{1} &= \\frac{\\cov(X_{i},Y_{i})}{\\V[X_{i}]}\n\\end{aligned}\n$$\n\nThus the slope on the population linear regression of $Y_i$ on $X_i$ is equal to the ratio of the covariance of the two variables divided by the variance of $X_i$. From this, we can immediately see that the covariance will determine the sign of the slope: positive covariances will lead to positive $\\beta_1$ and negative covariances will lead to negative $\\beta_1$. In addition, we can see that if $Y_i$ and $X_i$ are independent, $\\beta_1 = 0$. The slope scales this covariance by the variance of the covariate, so slopes are lower for more spread-out covariates and higher for more spread-out covariates. If we define the correlation between these variables as $\\rho_{YX}$, then we can relate the coefficient to this quantity as \n$$\n\\beta_1 = \\rho_{YX}\\sqrt{\\frac{\\V[Y_i]}{\\V[X_i]}}.\n$$\n\nCollecting together our results, we can write the population linear regression as \n$$\nm(x) = \\beta_0 + \\beta_1x = \\E[Y_i] + \\beta_1(x - \\E[X_i]),\n$$\nwhich shows how we adjust our best guess about $Y_i$ from the mean of the outcome using the covariate. \n\nIt's important to remember that the BLP, $m(x)$, and the CEF, $\\mu(x)$, are distinct entities. If the CEF is nonlinear, as in @fig-cef-blp, there will be a difference between these functions, meaning that the BLP might produce subpar predictions. Below, we will derive a formal connection between the BLP and the CEF. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Comparison of the CEF and the best linear predictor.](06_linear_model_files/figure-pdf/fig-cef-blp-1.pdf){#fig-cef-blp}\n:::\n:::\n\n\n\n\n### Beyond linear approximations\n\nThe linear part of the best linear predictor is less restrictive than at first glance. We can easily modify the minimum MSE problem to find the best quadratic, cubic, or general polynomial function of $X_i$ that predicts $Y_i$. For example, the quadratic function of $X_i$ that best predicts $Y_i$ would be\n$$ \nm(X_i, X_i^2) = \\beta_0 + \\beta_1X_i + \\beta_2X_i^2 \\quad\\text{where}\\quad \\argmin_{(b_0,b_1,b_2) \\in \\mathbb{R}^3}\\;\\E[(Y_{i} - b_{0} - b_{1}X_{i} - b_{2}X_{i}^{2})^{2}].\n$$\nThis equation is now a quadratic function of the covariates, but it is still a linear function of the unknown parameters $(\\beta_{0}, \\beta_{1}, \\beta_{2})$ so we will call this a best linear predictor. \n\nWe could include higher order terms of $X_i$ in the same manner, and as we include more polynomial terms, $X_i^p$, the more flexible the function of $X_i$ we will capture with the BLP. When we estimate the BLP, however, we usually will pay for this flexibility in terms of overfitting and high variance in our estimates. \n\n\n### Linear prediction with multiple covariates \n\nWe now generalize the idea of a best linear predictor to a setting with an arbitrary number of covariates. In this setting, remember that the linear function will be\n\n$$ \n\\bfx'\\bfbeta = x_{1}\\beta_{1} + x_{2}\\beta_{2} + \\cdots + x_{k}\\beta_{k}.\n$$\nWe will define the **best linear predictor** (BLP) to be\n$$ \nm(\\bfx) = \\bfx'\\bfbeta, \\quad \\text{where}\\quad \\bfbeta = \\argmin_{\\mb{b} \\in \\real^k}\\; \\E\\bigl[ \\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\\bigr]\n$$\n\nThis BLP solves the same fundamental optimization problem as in the bivariate case: it chooses the set of coefficients that minimizes the mean-squared error averaging over the joint distribution of the data. \n\n\n\n::: {.callout-note}\n## Best linear projection assumptions\n\nWithout some assumptions on the joint distribution of the data, the following \"regularity conditions\" will ensure the existence of the BLP:\n\n1. $\\E[Y^2] < \\infty$ (outcome has finite mean/variance)\n2. $\\E\\Vert \\mb{X} \\Vert^2 < \\infty$ ($\\mb{X}$ has finite means/variances/covariances)\n3. $\\mb{Q}_{\\mb{XX}} = \\E[\\mb{XX}']$ is positive definite (columns of $\\X$ are linearly independent) \n:::\n\nUnder these assumptions, it is possible to derive a closed-form expression for the **population coefficients** $\\bfbeta$ using matrix calculus. To set up the optimization problem, we will find the first-order condition by taking the derivative of the expectation of the squared errors. First, let's take the derivative of the squared prediction errors using the chain rule:\n$$ \n\\begin{aligned}\n \\frac{\\partial}{\\partial \\mb{b}}\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)^{2}\n &= 2\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)\\frac{\\partial}{\\partial \\mb{b}}(Y_{i} - \\X_{i}'\\mb{b}) \\\\\n &= -2\\left(Y_{i} - \\X_{i}'\\mb{b}\\right)\\X_{i} \\\\\n &= -2\\X_{i}\\left(Y_{i} - \\X_{i}'\\mb{b}\\right) \\\\\n &= -2\\left(\\X_{i}Y_{i} - \\X_{i}\\X_{i}'\\mb{b}\\right),\n\\end{aligned}\n$$\nwhere the third equality comes from the fact that $(Y_{i} - \\X_{i}'\\bfbeta)$ is a scalar. We can now plug this into the expectation to get the first-order condition and solve for $\\bfbeta$,\n$$ \n\\begin{aligned}\n 0 &= -2\\E[\\X_{i}Y_{i} - \\X_{i}\\X_{i}'\\bfbeta ] \\\\\n \\E[\\X_{i}\\X_{i}'] \\bfbeta &= \\E[\\X_{i}Y_{i}],\n\\end{aligned}\n$$\nwhich implies the population coefficients are\n$$ \n\\bfbeta = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] = \\mb{Q}_{\\mb{XX}}^{-1}\\mb{Q}_{\\mb{X}Y}\n$$\nWe now have an expression for the coefficients for the population best linear predictor in terms of the joint distribution $(Y_{i}, \\X_{i})$. A couple of facts might be useful for reasoning this expression. Recall that $\\mb{Q}_{\\mb{XX}} = \\E[\\X_{i}\\X_{i}']$ is a $k\\times k$ matrix and $\\mb{Q}_{\\X Y} = \\E[\\X_{i}Y_{i}]$ is a $k\\times 1$ column vector, which implies that $\\bfbeta$ is also a $k \\times 1$ column vector. \n\n::: {.callout-note}\n\nIntuitively, what is happening in the expression for the population regression coefficients? It is helpful to separate the intercept or constant term so that we have\n$$ \nY_{i} = \\beta_{0} + \\X'\\bfbeta + e_{i},\n$$\nso $\\bfbeta$ refers to just the vector of coefficients for the covariates. In this case, we can write the coefficients in a more interpretable way:\n$$ \n\\bfbeta = \\V[\\X]^{-1}\\text{Cov}(\\X, Y), \\qquad \\beta_0 = \\mu_Y - \\mb{\\mu}'_{\\mb{X}}\\bfbeta\n$$\n\nThus, the population coefficients take the covariance between the outcome and the covariates and \"divide\" it by information about variances and covariances of the covariates. The intercept recenters the regression so that projection errors are mean zero. Thus, we can see that these coefficients generalize the bivariate formula to this multiple covariate context. \n:::\n\nWith an expression for the population linear regression coefficients, we can write the linear projection as \n$$ \nm(\\X_{i}) = \\X_{i}'\\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] = \\X_{i}'\\mb{Q}_{\\mb{XX}}^{-1}\\mb{Q}_{\\mb{X}Y}\n$$\n\n\n\n### Projection error\n\nThe **projection error** is the difference between the actual value of $Y_i$ and the projection,\n$$ \ne_{i} = Y_{i} - m(\\X_{i}) = Y_i - \\X_{i}'\\bfbeta,\n$$\nwhere we have made no assumptions about this error yet. The projection error is simply the prediction error of the best linear prediction. Rewriting this definition, we can see that we can always write the outcome as the linear projection plus the projection error,\n$$ \nY_{i} = \\X_{i}'\\bfbeta + e_{i}.\n$$\nNotice that this looks suspiciously similar to a linearity assumption on the CEF, but we haven't made any assumptions here. Instead, we have just used the definition of the projection error to write a tautological statement: \n$$ \nY_{i} = \\X_{i}'\\bfbeta + e_{i} = \\X_{i}'\\bfbeta + Y_{i} - \\X_{i}'\\bfbeta = Y_{i}.\n$$\nThe critical difference between this representation and the usual linear model assumption is what properties $e_{i}$ possesses. \n\nOne key property of the projection errors is that when the covariate vector includes an \"intercept\" or constant term, the projection errors are uncorrelated with the covariates. To see this, we first note that $\\E[\\X_{i}e_{i}] = 0$ since\n$$ \n\\begin{aligned}\n \\E[\\X_{i}e_{i}] &= \\E[\\X_{{i}}(Y_{i} - \\X_{i}'\\bfbeta)] \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}\\X_{i}']\\bfbeta \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}\\X_{i}']\\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}] \\\\\n &= \\E[\\X_{i}Y_{i}] - \\E[\\X_{i}Y_{i}] = 0\n\\end{aligned}\n$$\nThus, for every $X_{ij}$ in $\\X_{i}$, we have $\\E[X_{ij}e_{i}] = 0$. If one of the entries in $\\X_i$ is a constant 1, then this also implies that $\\E[e_{i}] = 0$. Together, these facts imply that the projection error is uncorrelated with each $X_{ij}$, since\n$$ \n\\cov(X_{ij}, e_{i}) = \\E[X_{ij}e_{i}] - \\E[X_{ij}]\\E[e_{i}] = 0 - 0 = 0\n$$\nNotice that we still have made no assumptions about these projection errors except for some mild regularity conditions on the joint distribution of the outcome and covariates. Thus, in very general settings, we can write the linear projection model $Y_i = \\X_i'\\bfbeta + e_i$ where $\\bfbeta = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}]$ and conclude that $\\E[\\X_{i}e_{i}] = 0$ by definition, not by assumption. \n\nThe projection error is uncorrelated with the covariates, so does this mean that the CEF is linear? Unfortunately, no. Recall that while independence implies uncorrelated, the reverse does not hold. So when we look at the CEF, we have\n$$ \n\\E[Y_{i} \\mid \\X_{i}] = \\X_{i}'\\bfbeta + \\E[e_{i} \\mid \\X_{i}],\n$$\nand the last term $\\E[e_{i} \\mid \\X_{i}]$ would only be 0 if the errors were independent of the covariates, so $\\E[e_{i} \\mid \\X_{i}] = \\E[e_{i}] = 0$. But nowhere in the linear projection model did we assume this. So while we can (almost) always write the outcome as $Y_i = \\X_i'\\bfbeta + e_i$ and have those projection errors be uncorrelated with the covariates, it will require additional assumptions to ensure that the true CEF is, in fact, linear $\\E[Y_{i} \\mid \\X_{i}] = \\X_{i}'\\bfbeta$. \n\nLet's take a step back. What have we shown here? In a nutshell, we have shown that a population linear regression exists under very general conditions, and we can write the coefficients of that population linear regression as a function of expectations of the joint distribution of the data. We did not assume that the CEF was linear nor that the projection errors were normal. \n\n\nWhy do we care about this? The ordinary least squares estimator, the workhorse regression estimator, targets this quantity of interest in large samples, regardless of whether the true CEF is linear or not. Thus, even when a linear CEF assumption is incorrect, OLS still targets a perfectly valid quantity of interest: the coefficients from this population linear projection. \n\n## Linear CEFs without assumptions\n\nWhat is the relationship between the best linear predictor (which we just saw generally exists) and the CEF? To draw the connection, remember a vital property of the conditional expectation: it is the function of $\\X_i$ that best predicts $Y_{i}$. The population regression was the best **linear** predictor, but the CEF is the best predictor among all nicely behaved functions of $\\X_{i}$, linear or nonlinear. In particular, if we label $L_2$ to be the set of all functions of the covariates $g()$ that have finite squared expectation, $\\E[g(\\X_{i})^{2}] < \\infty$, then we can show that the CEF has the lowest squared prediction error in this class of functions:\n$$ \n\\mu(\\X) = \\E[Y_{i} \\mid \\X_{i}] = \\argmin_{g(\\X_i) \\in L_2}\\; \\E\\left[(Y_{i} - g(\\X_{i}))^{2}\\right],\n$$\n\nSo we have established that the CEF is the best predictor and the population linear regression $m(\\X_{i})$ is the best linear predictor. These two facts allow us to connect the CEF and the population regression.\n\n::: {#thm-cef-blp}\nIf $\\mu(\\X_{i})$ is a linear function of $\\X_i$, then $\\mu(\\X_{i}) = m(\\X_{i}) = \\X_i'\\bfbeta$. \n\n:::\n\nThis theorem says that if the true CEF is linear, it equals the population linear regression. The proof of this is straightforward: the CEF is the best predictor, so if it is linear, it must also be the best linear predictor. \n \n \nIn general, we are in the business of learning about the CEF, so we are unlikely to know if it genuinely is linear or not. In some situations, however, we can show that the CEF is linear without any additional assumptions. These will be situations when the covariates take on a finite number of possible values. Suppose we are interested in the CEF of poll wait times for Black ($X_i = 1$) vs. non-Black ($X_i = 0$) voters. In this case, there are two possible values of the CEF, $\\mu(1) = \\E[Y_{i}\\mid X_{i}= 1]$, the average wait time for Black voters, and $\\mu(0) = \\E[Y_{i}\\mid X_{i} = 0]$, the average wait time for non-Black voters. Notice that we can write the CEF as\n$$ \n\\mu(x) = x \\mu(1) + (1 - x) \\mu(0) = \\mu(0) + x\\left(\\mu(1) - \\mu(0)\\right)= \\beta_0 + x\\beta_1,\n$$\nwhich is clearly a linear function of $x$. Based on this derivation, we can see that the coefficients of this linear CEF have a clear interpretation:\n\n- $\\beta_0 = \\mu(0)$: the expected wait time for a Black voter. \n- $\\beta_1 = \\mu(1) - \\mu(0)$: the difference in average wait times between Black and non-Black voters. \nNotice that it matters how $X_{i}$ is defined here since the intercept will always be the average outcome when $X_i = 0$, and the slope will always be the difference in means between the $X_i = 1$ group and the $X_i = 0$ group. \n\nWhat about a categorical covariate with more than two levels? For instance, we might be interested in wait times by party identification, where $X_i = 1$ indicates Democratic voters, $X_i = 2$ indicates Republican voters, and $X_i = 3$ indicates independent voters. How can we write the CEF of wait times as a linear function of this variable? That would assume that the difference between Democrats and Republicans is the same as for Independents and Republicans. With more than two levels, we can represent a categorical variable as a vector of binary variables, $\\X_i = (X_{i1}, X_{i2})$, where\n$$ \n\\begin{aligned}\n X_{{i1}} &= \\begin{cases}\n 1&\\text{if Republican} \\\\\n 0 & \\text{if not Republican}\n \\end{cases} \\\\\nX_{{i2}} &= \\begin{cases}\n 1&\\text{if independent} \\\\\n 0 & \\text{if not independent}\n \\end{cases} \\\\\n\\end{aligned}\n$$\nThese two indicator variables encode the same information as the original three-level variable, $X_{i}$. If I know the values of $X_{i1}$ and $X_{i2}$, I know exactly what party to which $i$ belongs. Thus, the CEFs for $X_i$ and the pair of indicator variables, $\\X_i$, are precisely the same, but the latter admits a lovely linear representation,\n$$\n\\E[Y_i \\mid X_{i1}, X_{i2}] = \\beta_0 + \\beta_1 X_{i1} + \\beta_2 X_{i2},\n$$\nwhere\n\n- $\\beta_0 = \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the average wait time for the group who does not get an indicator variable (Democrats in this case). \n- $\\beta_1 = \\E[Y_{i} \\mid X_{i1} = 1, X_{i2} = 0] - \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the difference in means between Republican voters and Democratic voters, or the difference between the first indicator group and the baseline group. \n- $\\beta_2 = \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 1] - \\E[Y_{i} \\mid X_{i1} = 0, X_{i2} = 0]$ is the difference in means between independent voters and Democratic voters, or the difference between the second indicator group and the baseline group.\n\nThis approach easily generalizes to categorical variables with an arbitrary number of levels. \n\nWhat have we shown? The CEF will be linear without additional assumptions when there is a categorical covariate. We can show that this continues to hold even when we have multiple categorical variables. We now have two binary covariates: $X_{i1}=1$ indicating a Black voter, and $X_{i2} = 1$ indicating an urban voter. With these two binary variables, there are four possible values of the CEF:\n$$ \n\\mu(x_1, x_2) = \\begin{cases} \n \\mu_{00} & \\text{if } x_1 = 0 \\text{ and } x_2 = 0 \\text{ (non-Black, rural)} \\\\\n \\mu_{10} & \\text{if } x_1 = 1 \\text{ and } x_2 = 0 \\text{ (Black, rural)} \\\\\n \\mu_{01} & \\text{if } x_1 = 0 \\text{ and } x_2 = 1 \\text{ (non-Black, urban)} \\\\\n \\mu_{11} & \\text{if } x_1 = 1 \\text{ and } x_2 = 1 \\text{ (Black, urban)}\n \\end{cases}\n$$\nWe can write this as\n$$ \n\\mu(x_{1}, x_{2}) = (1 - x_{1})(1 - x_{2})\\mu_{00} + x_{1}(1 -x_{2})\\mu_{10} + (1-x_{1})x_{2}\\mu_{01} + x_{1}x_{2}\\mu_{11},\n$$\nwhich we can rewrite as \n$$ \n\\mu(x_1, x_2) = \\beta_0 + x_1\\beta_1 + x_2\\beta_2 + x_1x_2\\beta_3,\n$$\nwhere\n\n- $\\beta_0 = \\mu_{00}$: average wait times for rural non-Black voters. \n- $\\beta_1 = \\mu_{10} - \\mu_{00}$: difference in means for rural Black vs. rural non-Black voters. \n- $\\beta_2 = \\mu_{01} - \\mu_{00}$: difference in means for urban non-Black vs. rural non-Black voters. \n- $\\beta_3 = (\\mu_{11} - \\mu_{01}) - (\\mu_{10} - \\mu_{00})$: difference in urban racial difference vs rural racial difference.\n\nThus, we can write the CEF with two binary covariates as linear when the linear specification includes a multiplicative interaction between them ($x_1x_2$). This result holds for all pairs of binary covariates, and we can generalize the interpretation of the coefficients in the CEF as\n\n- $\\beta_0 = \\mu_{00}$: average outcome when both variables are 0. \n- $\\beta_1 = \\mu_{10} - \\mu_{00}$: difference in average outcomes for the first covariate when the second covariate is 0. \n- $\\beta_2 = \\mu_{01} - \\mu_{00}$: difference in average outcomes for the second covariate when the first covariate is 0. \n- $\\beta_3 = (\\mu_{11} - \\mu_{01}) - (\\mu_{10} - \\mu_{00})$: change in the \"effect\" of the first (second) covariate when the second (first) covariate goes from 0 to 1. \n\nThis result also generalizes to an arbitrary number of binary covariates. If we have $p$ binary covariates, then the CEF will be linear with all two-way interactions, $x_1x_2$, all three-way interactions, $x_1x_2x_3$, up to the $p$-way interaction $x_1\\times\\cdots\\times x_p$. Furthermore, we can generalize to arbitrary numbers of categorical variables by expanding each into a series of binary variables and then including all interactions between the resulting binary variables. \n\n\nWe have established that when we have a set of categorical covariates, the true CEF will be linear, and we have seen the various ways to represent that CEF. Notice that when we use, for example, ordinary least squares, we are free to choose how to include our variables. That means that we could run a regression of $Y_i$ on $X_{i1}$ and $X_{i2}$ without an interaction term. This model will only be correct if $\\beta_3$ is equal to 0, and so the interaction term is irrelevant. Because of this ability to choose our models, it's helpful to have a language for models that capture the linear CEF appropriately. We call a model **saturated** if there are as many coefficients as the CEF's unique values. A saturated model, by its nature, can always be written as a linear function without assumptions. The above examples show how to construct saturated models in various situations.\n\n## Interpretation of the regression coefficients\n\nWe have seen how to interpret population regression coefficients when the CEF is linear without assumptions. How do we interpret the population coefficients $\\bfbeta$ in other settings? \n\n\nLet's start with the simplest case, where every entry in $\\X_{i}$ represents a different covariate and no covariate is any function of another (we'll see why this caveat is necessary below). In this simple case, the $k$th coefficient, $\\beta_{k}$, will represent the change in the predicted outcome for a one-unit change in the $k$th covariate $X_{ik}$, holding all other covariates fixed. We can see this from \n$$ \n\\begin{aligned}\n m(x_{1} + 1, x_{2}) & = \\beta_{0} + \\beta_{1}(x_{1} + 1) + \\beta_{2}x_{2} \\\\\n m(x_{1}, x_{2}) &= \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{2},\n\\end{aligned} \n$$\nso that the change in the predicted outcome for increasing $X_{i1}$ by one unit is\n$$\n m(x_{1} + 1, x_{2}) - m(x_{1}, x_{2}) = \\beta_1\n$$\nNotice that nothing changes in this interpretation if we add more covariates to the vector,\n$$\n m(x_{1} + 1, \\bfx_{2}) - m(x_{1}, \\bfx_{2}) = \\beta_1,\n$$\nthe coefficient on a particular variable is the change in the predicted outcome for a one-unit change in the covariate holding all other covariates constant. Each coefficient summarizes the \"all else equal\" difference in the predicted outcome for each covariate. \n\n\n### Polynomial functions of the covariates\n\n\n\nThe interpretation of the population regression coefficients becomes more complicated when we include nonlinear functions of the covariates. In that case, multiple coefficients control how a change in a covariate will change the predicted value of $Y_i$. Suppose that we have a quadratic function of $X_{i1}$,\n$$ \nm(x_1, x_1^2, x_{2}) = \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{1}^{2} + \\beta_{3}x_{2},\n$$\nand try to look at a one-unit change in $x_1$,\n$$ \n\\begin{aligned}\n m(x_{1} + 1, (x_{1} + 1)^{2}, x_{2}) & = \\beta_{0} + \\beta_{1}(x_{1} + 1) + \\beta_{2}(x_{1} + 1)^{2}+ \\beta_{3}x_{2} \\\\\n m(x_{1}, x_{1}^{2}, x_{2}) &= \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{1}^{2} + \\beta_{3}x_{2},\n\\end{aligned} \n$$\nresulting in $\\beta_1 + \\beta_2(2x_{1} + 1)$. This formula might be an interesting quantity, but we will more commonly use the derivative of $m(\\bfx)$ with respect to $x_1$ as a measure of the marginal effect of $X_{i1}$ on the predicted value of $Y_i$ (holding all other variables constant), where \"marginal\" here means the change in prediction for a very small change in $X_{i1}$.[^effect] In the case of the quadratic covariate, we have\n$$ \n\\frac{\\partial m(x_{1}, x_{1}^{2}, x_{2})}{\\partial x_{1}} = \\beta_{1} + 2\\beta_{2}x_{1},\n$$\nso the marginal effect on prediction varies as a function of $x_1$. From this, we can see that the individual interpretations of the coefficients are less interesting: $\\beta_1$ is the marginal effect when $X_{i1} = 0$ and $\\beta_2 / 2$ describes how a one-unit change in $X_{i1}$ changes the marginal effect. As is hopefully clear, it will often be more straightforward to visualize the nonlinear predictor function (perhaps using the orthogonalization techniques in @sec-fwl). \n\n\n[^effect]: Notice the choice of language here. The marginal effect is on the predicted value of $Y_i$, not on $Y_i$ itself. So these marginal effects are associational, not necessarily causal quantities. \n\n### Interactions\n\nAnother common nonlinear function of the covariates is when we include **interaction terms** or covariates that are products of two other covariates,\n$$ \nm(x_{1}, x_{2}, x_{1}x_{2}) = \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{2} + \\beta_{3}x_{1}x_{2}.\n$$\nIn these situations, we can also use the derivative of the BLP to measure the marginal effect of one variable or the other on the predicted value of $Y_i$. In particular, we have\n$$ \n\\begin{aligned}\n \\frac{\\partial m(x_{1}, x_{2}, x_{1}x_{2})}{\\partial x_1} &= \\beta_1 + \\beta_3x_2, \\\\\n \\frac{\\partial m(x_{1}, x_{2}, x_{1}x_{2})}{\\partial x_2} &= \\beta_2 + \\beta_3x_1.\n\\end{aligned}\n$$\nHere, the coefficients are slightly more interpretable:\n\n* $\\beta_1$: the marginal effect of $X_{i1}$ on predicted $Y_i$ when $X_{i2} = 0$.\n* $\\beta_2$: the marginal effect of $X_{i2}$ on predicted $Y_i$ when $X_{i1} = 0$.\n* $\\beta_3$: the change in the marginal effect of $X_{i1}$ due to a one-unit change in $X_{i2}$ **OR** the change in the marginal effect of $X_{i2}$ due to a one-unit change in $X_{i1}$.\n\nIf we add more covariates to this BLP, these interpretations change to \"holding all other covariates constant.\"\n\nInteractions are a routine part of social science research because they allow us to assess how the relationship between the outcome and an independent variable varies by the values of another variable. In the context of our study of voter wait times, if $X_{i1}$ is income and $X_{i2}$ is the Black/non-Black voter indicator, then $\\beta_3$ represents the change in the slope of the wait time-income relationship between Black and non-Black voters. \n\n\n## Multiple regression from bivariate regression {#sec-fwl}\n\nWhen we have a regression of an outcome on two covariates, it is helpful to understand how the coefficients of one variable relate to the other. For example, if we have the following best linear projection:\n$$ \n(\\alpha, \\beta, \\gamma) = \\argmin_{(a,b,c) \\in \\mathbb{R}^{3}} \\; \\E[(Y_{i} - (a + bX_{i} + cZ_{i}))^{2}]\n$$ {#eq-two-var-blp}\nIs there some way to understand the $\\beta$ coefficient here regarding simple linear regression? As it turns out, yes. From the above results, we know that the intercept has a simple form:\n$$\n\\alpha = \\E[Y_i] - \\beta\\E[X_i] - \\gamma\\E[Z_i].\n$$\nLet's investigate the first order condition for $\\beta$:\n$$ \n\\begin{aligned}\n 0 &= \\E[Y_{i}X_{i}] - \\alpha\\E[X_{i}] - \\beta\\E[X_{i}^{2}] - \\gamma\\E[X_{i}Z_{i}] \\\\\n &= \\E[Y_{i}X_{i}] - \\E[Y_{i}]\\E[X_{i}] + \\beta\\E[X_{i}]^{2} + \\gamma\\E[X_{i}]\\E[Z_{i}] - \\beta\\E[X_{i}^{2}] - \\gamma\\E[X_{i}Z_{i}] \\\\\n &= \\cov(Y, X) - \\beta\\V[X_{i}] - \\gamma \\cov(X_{i}, Z_{i})\n\\end{aligned}\n$$\nWe can see from this that if $\\cov(X_{i}, Z_{i}) = 0$, then the coefficient on $X_i$ will be the same as in the simple regression case, $\\cov(Y_{i}, X_{i})/\\V[X_{i}]$. When $X_i$ and $Z_i$ are uncorrelated, we sometimes call them **orthogonal**. \n\nTo write a simple formula for $\\beta$ when the covariates are not orthogonal, we will **orthogonalize** $X_i$ by obtaining the prediction errors from a population linear regression of $X_i$ on $Z_i$:\n$$ \n\\widetilde{X}_{i} = X_{i} - (\\delta_{0} + \\delta_{1}Z_{i}) \\quad\\text{where}\\quad (\\delta_{0}, \\delta_{1}) = \\argmin_{(d_{0},d_{1}) \\in \\mathbb{R}^{2}} \\; \\E[(X_{i} - (d_{0} + d_{1}Z_{i}))^{2}]\n$$\nGiven the properties of projection errors, we know that this orthogonalized version of $X_{i}$ will be uncorrelated with $Z_{i}$ since $\\E[\\widetilde{X}_{i}Z_{i}] = 0$. Remarkably, the coefficient on $X_i$ from the \"long\" BLP in @eq-two-var-blp is the same as the regression of $Y_i$ on this orthogonalized $\\widetilde{X}_i$, \n$$ \n\\beta = \\frac{\\text{cov}(Y_{i}, \\widetilde{X}_{i})}{\\V[\\widetilde{X}_{i}]}\n$$\n\nWe can expand this idea to when there are several other covariates. Suppose now that we are interested in a regression of $Y_i$ on $\\X_i$ and we are interested in the coefficient on the $k$th covariate. Let $\\X_{i,-k}$ be the vector of covariates omitting the $k$th entry and let $m_k(\\X_{i,-k})$ represent the BLP of $X_{ik}$ on these other covariates. We can define $\\widetilde{X}_{ik} = X_{ik} - m_{k}(\\X_{i,-k})$ as the $k$th variable orthogonalized with respect to the rest of the variables and we can write the coefficient on $X_{ik}$ as\n$$ \n\\beta_k = \\frac{\\cov(Y_i, \\widetilde{X}_{ik})}{\\V[\\widetilde{X}_{ik}]}.\n$$ \nThus, the population regression coefficient in the BLP is the same as from a bivariate regression of the outcome on the projection error for $X_{ik}$ projected on all other covariates. One interpretation of coefficients in a population multiple regression is they represent the relationship between the outcome and the covariate after removing the linear relationships of all other variables. \n\n\n## Omitted variable bias\n\nIn many situations, we might need to choose whether to include a variable in a regression or not, so it can be helpful to understand how this choice might affect the population coefficients on the other variables in the regression. Suppose we have a variable $Z_i$ that we may add to our regression which currently has $\\X_i$ as the covariates. We can write this new projection as \n$$ \nm(\\X_i, Z_i) = \\X_i'\\bfbeta + Z_i\\gamma, \\qquad m(\\X_{i}) = \\X_i'\\bs{\\delta},\n$$\nwhere we often refer to $m(\\X_i, Z_i)$ as the long regression and $m(\\X_i)$ as the short regression. \n\nWe know from the definition of the BLP that we can write the short coefficients as \n$$ \n\\bs{\\delta} = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1} \\E[\\X_{i}Y_{i}].\n$$\nLetting $e_i = Y_i - m(\\X_{i}, Z_{i})$ be the projection errors from the long regression, we can write this as\n$$ \n\\begin{aligned}\n \\bs{\\delta} &= \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1} \\E[\\X_{i}(\\X_{i}'\\bfbeta + Z_{i}\\gamma + e_{i})] \\\\\n &= \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}(\\E[\\X_{i}\\X_{i}']\\bfbeta + \\E[\\X_{i}Z_{i}]\\gamma + \\E[\\X_{i}e_{i}]) \\\\\n &= \\bfbeta + \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Z_{i}]\\gamma\n\\end{aligned}\n$$\nNotice that the vector in the second term is the linear projection coefficients of a population linear regression of $Z_i$ on the $\\X_i$. If we call these coefficients $\\bs{\\pi}$, then the short coefficients are \n$$ \n\\bs{\\delta} = \\bfbeta + \\bs{\\pi}\\gamma. \n$$\n\nWe can rewrite this to show that the difference between the coefficients in these two projections is $\\bs{\\delta} - \\bfbeta= \\bs{\\pi}\\gamma$ or the product of the coefficient on the \"excluded\" $Z_i$ and the coefficient of the included $\\X_i$ on the excluded. Most textbooks refer to this difference as the **omitted variable bias** of omitting $Z_i$ under the idea that $\\bfbeta$ is the true target of inference. But the result is much broader than this since it just tells us how to relate the coefficients of two nested projections. \n\n\nThe last two results (multiple regressions from bivariate and omitted variable bias) are sometimes presented as results for the ordinary least squares estimator that we will show in the next chapter. We introduce them here as features of a particular population quantity, the linear projection or population linear regression. \n\n\n## Drawbacks of the BLP\n\nThe best linear predictor is, of course, a *linear* approximation to the CEF, and this approximation could be quite poor if the true CEF is highly nonlinear. A more subtle issue with the BLP is that it is sensitive to the marginal distribution of the covariates when the CEF is nonlinear. Let's return to our example of voter wait times and income. In @fig-blp-limits, we show the true CEF and the BLP when we restrict income below \\$50,000 or above \\$100,000. The BLP can vary quite dramatically here. This figure is an extreme example, but the essential point will still hold as the marginal distribution of $X_i$ changes.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Linear projections for when truncating income distribution below $50k and above $100k.](06_linear_model_files/figure-pdf/fig-blp-limits-1.pdf){#fig-blp-limits}\n:::\n:::\n", "supporting": [ "06_linear_model_files/figure-pdf" ], diff --git a/_freeze/06_linear_model/figure-pdf/fig-blp-limits-1.pdf b/_freeze/06_linear_model/figure-pdf/fig-blp-limits-1.pdf index 9ff9becaea1bd7f9bafc3a15a62b4c9bde180952..f1639681a45ebb8e62d55047a0a8505ff76778fb 100644 GIT binary patch delta 394 zcmeyU`$czxGNai<6+>xf)>vWISZD6mCoy`p=VP2%r`N1=&|`=yW0Rej8_j66@nk+5 zhoP~dsgb#%>0}3X5jbaa47&&~qv_;A5e>N5) zu7Rn#fq^=1ZH5MukBdYK7%6Ckq(&(CWR|5W08KTVY#=JmXgt|N)L7F{!2kpl@)Wqh z3MbnMr3@_$F_arxnwp`jGcqtW!W6T>RA+2tj;;G@BeDW*28+Y;0=g>}>4f?CNH0?qXxf)>vWISZD6mCoy`p=VP2%r`N1=&|`=yW0Rej8^vU5vGL?w zRTV>HLsKJjLlaFdec${Pm&B4(1q~M~BLgF2LqnL{<}S61yo{!kV@xzSU}BSdOr#N< z^(K)TrV1J%sSyf3nPsU8ApM#O*#(Iu8O5oI3T3H9#hLke3TAr7dZv>NOpP^76bwK> zAy0t|%m5l|X@MqYU|?u$FuBK6z1|pI#?su<5<|?=#1Ku)(7?bL-Bh5Mg&BrApiP)! z=1@E05Y99(GQx1WfsrY?V+;*U4KN&IY;1&KiLtS{!Q=^McFvAYjxNroh6X0)rlyV- quEwsGmM+FFh8E_oW=_UNPA+x|HiVQ+7Bm;;G&bZ?Rdw}u;{pJZlXqVL delta 504 zcmdmxw>EErGNbWC6+;PumDw2!C(Jp$^!?226%Qtz7d^qs5b}cc&BTHzCNrarXXdJM z7?>KG7@8VbPA*gvfpa$Zs$Jw|G@2Z5q5&71+;1X{;A}F9)G$)e2uY1l@X0JoRWLFz zGSpPaE=VlNC{9gOC`&CW&dkqKFw-;EGoEa0YOHCfU;qLNc?w)$hM}>cr74=2fq|is z`Q(06b#()D8B23ZGYm1y$sbKc>W$HjF*Go+Fh)~s2r?Q|%pBcNLjxlt45u0xnPRFl zv9ti2r;O|;V?zu}jE&9ACr>uBi#IfIb8|LxG%#^8G%#~9b#gPbbaFLtGc$H`adR;@ mb+uElA*du)!Oo7WxFoTtq@pM_jmyy3%*cXERn^tsjSB#wTYDz} diff --git a/_freeze/06_linear_model/figure-pdf/fig-cef-blp-1.pdf b/_freeze/06_linear_model/figure-pdf/fig-cef-blp-1.pdf index d445a4834e4e108df4563af9c47a7e5e9bab7e6e..fc6358939d4f141f795aa35a121937cf36dbe5ce 100644 GIT binary patch delta 509 zcmeCros>I4nbB;bilMYKYpgJ9tTT7(lNi0)^D)k>(`(i_=rKfM3YP3H$TNCu_RSN!^O(Tz{uFp5GJ?TP;CJ(qv_;J#v0mCF;iUwQ*{Fa zbv)WkC##x73YaQrgrr6&_+*x)Du7I#+-D*VG-{cNv8IWF0SGAMDR6-qhQMws=E5SaRC6r9CTLz delta 514 zcmbP~+n+l@nbTMS3MQ}9P@L$?AtA6bJ7eL5Ij5JtpP9Yl!KCw|Cs-LmUa-EISP;!< zw(*RMs)~WBp^2fXfu$ywzHfetOJYf?f`*Hgk%5u1p&?9evx(XQUPhzISB*8aVPd)l zrs@U;>iD!x)-Z__FjCM6NsUnO$t+7%0Getzd7_Cpqw(aGCdQhE3I-sckf*=}W*8b9 zn3|)B85kHEnNMalRZlfSm$5WAz))>zVPc7)&eFsTQ_Ru?P0Y~1z}RB)L{s5FV+1JEF=In>blVM$jf^ZOe>Js>cQ$u+bvCzfGjcODH*|D%bF(xwGITU`vM@Gw pa&&cZvQw}js3caw&W@|NB(bQZq9`?u%h1@w(tt}<)z#mP3jmDOdoTb1 From 7596ec078bf3e2ec95e225e472fb897a7e1ea91f Mon Sep 17 00:00:00 2001 From: Matt Blackwell Date: Mon, 20 Nov 2023 12:27:34 -0500 Subject: [PATCH 2/3] ch 6 typos fixes #38 fixes #39 fixes #41 fixes #42 fixes #44 fixes #45 --- 07_least_squares.qmd | 12 ++++++------ .../execute-results/html.json | 4 ++-- .../07_least_squares/execute-results/tex.json | 4 ++-- .../figure-pdf/fig-ajr-scatter-1.pdf | Bin 11905 -> 11905 bytes .../figure-pdf/fig-influence-1.pdf | Bin 11250 -> 11250 bytes .../figure-pdf/fig-outlier-1.pdf | Bin 11258 -> 11258 bytes .../figure-pdf/fig-ssr-comp-1.pdf | Bin 13324 -> 13324 bytes .../figure-pdf/fig-ssr-vs-tss-1.pdf | Bin 19857 -> 19857 bytes 8 files changed, 10 insertions(+), 10 deletions(-) diff --git a/07_least_squares.qmd b/07_least_squares.qmd index 88690e9..2c79639 100644 --- a/07_least_squares.qmd +++ b/07_least_squares.qmd @@ -284,7 +284,7 @@ $$ ## Rank, linear independence, and multicollinearity {#sec-rank} -When introducing the OLS estimator, we noted that it would exist when $\sum_{i=1}^n \X_i\X_i'$ is positive definite or that there is "no multicollinearity." This assumption is equivalent to saying that the matrix $\mathbb{X}$ is full column rank, meaning that $\text{rank}(\mathbb{X}) = (k+1)$, where $k+1$ is the number of columns of $\mathbb{X}$. Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and **linear independence** means that if $\mathbb{X}\mb{b} = 0$ if and only if $\mb{b}$ is a column vector of 0s. In other words, we have +When introducing the OLS estimator, we noted that it would exist when $\sum_{i=1}^n \X_i\X_i'$ is positive definite or that there is "no multicollinearity." This assumption is equivalent to saying that the matrix $\mathbb{X}$ is full column rank, meaning that $\text{rank}(\mathbb{X}) = (k+1)$, where $k+1$ is the number of columns of $\mathbb{X}$. Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and **linear independence** means that $\mathbb{X}\mb{b} = 0$ if and only if $\mb{b}$ is a column vector of 0s. In other words, we have $$ b_{1}\mathbb{X}_{1} + b_{2}\mathbb{X}_{2} + \cdots + b_{k+1}\mathbb{X}_{k+1} = 0 \quad\iff\quad b_{1} = b_{2} = \cdots = b_{k+1} = 0, $$ @@ -299,7 +299,7 @@ $$ $$ In this case, this expression equals 0 when $b_3 = b_4 = \cdots = b_{k+1} = 0$ and $b_1 = -2b_2$. Thus, the collection of columns is linearly dependent, so we know that the rank of $\mathbb{X}$ must be less than full column rank (that is, less than $k+1$). Hopefully, it is also clear that if we removed the problematic column $\mathbb{X}_2$, the resulting matrix would have $k$ linearly independent columns, implying that $\mathbb{X}$ is rank $k$. -Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that $\Xmat$ if of full column rank if and only if $\Xmat'\Xmat$ is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of $\Xmat$ being linearly independent means that the inverse $(\Xmat'\Xmat)^{-1}$ exists and so does $\bhat$. Further, this full rank condition also implies that $\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'$ is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals. +Why does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that $\Xmat$ is of full column rank if and only if $\Xmat'\Xmat$ is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of $\Xmat$ being linearly independent means that the inverse $(\Xmat'\Xmat)^{-1}$ exists and so does $\bhat$. Further, this full rank condition also implies that $\Xmat'\Xmat = \sum_{i=1}^{n}\X_{i}\X_{i}'$ is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals. What are common situations that lead to violations of no multicollinearity? We have seen one above, with one variable being a linear function of another. But this problem can come out in more subtle ways. Suppose that we have a set of dummy variables corresponding to a single categorical variable, like the region of the country. In the US, this might mean we have $X_{i1} = 1$ for units in the West (0 otherwise), $X_{i2} = 1$ for units in the Midwest (0 otherwise), $X_{i3} = 1$ for units in the South (0 otherwise), and $X_{i4} = 1$ for units in the Northeast (0 otherwise). Each unit has to be in one of these four regions, so there is a linear dependence between these variables, $$ @@ -333,7 +333,7 @@ Note that these interpretations only hold when the regression consists solely of OLS has a very nice geometric interpretation that can add a lot of intuition for various aspects of the method. In this geometric approach, we view $\mb{Y}$ as an $n$-dimensional vector in $\mathbb{R}^n$. As we saw above, OLS in matrix form is about finding a linear combination of the covariate matrix $\Xmat$ closest to this vector in terms of the Euclidean distance (which is just the sum of squares). -Let $\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^2\}$ be the **column space** of the matrix $\Xmat$. This set is all linear combinations of the columns of $\Xmat$ or the set of all possible linear predictions we could obtain from $\Xmat$. Notice that the OLS fitted values, $\Xmat\bhat$, are in this column space. If, as we assume, $\Xmat$ has full column rank of $k+1$, then the column space $\mathcal{C}(\Xmat)$ will be a $k+1$-dimensional surface inside of the larger $n$-dimensional space. If $\Xmat$ has two columns, the column space will be a plane. +Let $\mathcal{C}(\Xmat) = \{\Xmat\mb{b} : \mb{b} \in \mathbb{R}^(k+1)\}$ be the **column space** of the matrix $\Xmat$. This set is all linear combinations of the columns of $\Xmat$ or the set of all possible linear predictions we could obtain from $\Xmat$. Notice that the OLS fitted values, $\Xmat\bhat$, are in this column space. If, as we assume, $\Xmat$ has full column rank of $k+1$, then the column space $\mathcal{C}(\Xmat)$ will be a $k+1$-dimensional surface inside of the larger $n$-dimensional space. If $\Xmat$ has two columns, the column space will be a plane. Another interpretation of the OLS estimator is that it finds the linear predictor as the closest point in the column space of $\Xmat$ to the outcome vector $\mb{Y}$. This is called the **projection** of $\mb{Y}$ onto $\mathcal{C}(\Xmat)$. @fig-projection shows this projection for a case with $n=3$ and 2 columns in $\Xmat$. The shaded blue region represents the plane of the column space of $\Xmat$, and we can see that $\Xmat\bhat$ is the closest point to $\mb{Y}$ in that space. That's the whole idea of the OLS estimator: find the linear combination of the columns of $\Xmat$ (a point in the column space) that minimizes the Euclidean distance between that point and the outcome vector (the sum of squared residuals). @@ -431,7 +431,7 @@ The residual regression approach is: 1. Use OLS to regress $\mb{Y}$ on $\Xmat_2$ and obtain residuals $\widetilde{\mb{e}}_2$. 2. Use OLS to regress each column of $\Xmat_1$ on $\Xmat_2$ and obtain residuals $\widetilde{\Xmat}_1$. -3. Use OLS to regression $\widetilde{\mb{e}}_{2}$ on $\widetilde{\Xmat}_1$. +3. Use OLS to regress $\widetilde{\mb{e}}_{2}$ on $\widetilde{\Xmat}_1$. ::: @@ -469,7 +469,7 @@ h_{ii} = \X_{i}'\left(\Xmat'\Xmat\right)^{-1}\X_{i}, $$ which is the $i$th diagonal entry of the projection matrix, $\mb{P}_{\Xmat}$. Notice that $$ -\widehat{\mb{Y}} = \mb{P}\mb{Y} \qquad \implies \qquad \widehat{Y}_i = \sum_{j=1}^n h_{ij}Y_j, +\widehat{\mb{Y}} = \mb{P}_{\Xmat}\mb{Y} \qquad \implies \qquad \widehat{Y}_i = \sum_{j=1}^n h_{ij}Y_j, $$ so that $h_{ij}$ is the importance of observation $j$ for the fitted value for observation $i$. The leverage, then, is the importance of the observation for its own fitted value. We can also interpret these values in terms of the distribution of $\X_{i}$. Roughly speaking, these values are the weighted distance $\X_i$ is from $\overline{\X}$, where the weights normalize to the empirical variance/covariance structure of the covariates (so that the scale of each covariate is roughly the same). We can see this most clearly when we fit a simple linear regression (with one covariate and an intercept) with OLS when the leverage is $$ @@ -545,7 +545,7 @@ text(5, 2, "Full sample", pos = 2, col = "dodgerblue") text(7, 7, "Influence Point", pos = 1, col = "indianred") ``` -One measure of influence is called DFBETA$_i$ measures how much $i$ changes the estimated coefficient vector +One measure of influence, called DFBETA$_i$, measures how much $i$ changes the estimated coefficient vector $$ \bhat - \bhat_{(-i)} = \left( \Xmat'\Xmat\right)^{-1}\X_i\widetilde{e}_i, $$ diff --git a/_freeze/07_least_squares/execute-results/html.json b/_freeze/07_least_squares/execute-results/html.json index 67a9619..ade23be 100644 --- a/_freeze/07_least_squares/execute-results/html.json +++ b/_freeze/07_least_squares/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "04505ae479cd88516932111dfa804a74", + "hash": "90f4eadf59de9404c076aba8c58ea089", "result": { - "markdown": "# The mechanics of least squares {#sec-ols-mechanics}\n\nThis chapter explores the most widely used estimator for population linear regressions: **ordinary least squares** (OLS). OLS is a plug-in estimator for the best linear projection (or population linear regression) described in the last chapter. Its popularity is partly due to its ease of interpretation, computational simplicity, and statistical efficiency. \n\nIn this chapter, we focus on motivating the estimator and the mechanical or algebraic properties of the OLS estimator. In the next chapter, we will investigate its statistical assumptions. Textbooks often introduce OLS under an assumption of a linear model for the conditional expectation, but this is unnecessary if we view the inference target as the best linear predictor. We discuss this point more fully in the next chapter. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Relationship between political institutions and economic development from Acemoglu, Johnson, and Robinson (2001).](07_least_squares_files/figure-html/fig-ajr-scatter-1.png){#fig-ajr-scatter width=672}\n:::\n:::\n\n\n\n\n## Deriving the OLS estimator \n\nIn the last chapter on the linear model and the best linear projection, we operated purely in the population, not samples. We derived the population regression coefficients $\\bfbeta$, representing the coefficients on the line of best fit in the population. We now take these as our quantity of interest. \n\n::: {.callout-note}\n## Assumption\n\n\nThe variables $\\{(Y_1, \\X_1), \\ldots, (Y_i,\\X_i), \\ldots, (Y_n, \\X_n)\\}$ are i.i.d. draws from a common distribution $F$.\n\n:::\n\nRecall the population linear coefficients (or best linear predictor coefficients) that we derived in the last chapter,\n$$ \n\\bfbeta = \\argmin_{\\mb{b} \\in \\real^k}\\; \\E\\bigl[ \\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\\bigr] = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}]\n$$\n\nWe will consider two different ways to derive the OLS estimator for these coefficients, both of which are versions of the plug-in principle. The first approach is to use the closed-form representation of the coefficients and replace any expectations with sample means,\n$$ \n\\bhat = \\left(\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\frac{1}{n} \\sum_{i=1}^n \\X_{i}Y_{i} \\right),\n$$\nwhich exists if $\\sum_{i=1}^n \\X_i\\X_i'$ is **positive definite** and thus invertible. We will return to this assumption below. \n\n\nIn a simple bivariate linear projection model $m(X_{i}) = \\beta_0 + \\beta_1X_{i}$, we saw that the population slope was $\\beta_1= \\text{cov}(Y_{i},X_{i})/ \\V[X_{i}]$ and this approach would have our estimator for the slope be the ratio of the sample covariance of $Y_i$ and $X_i$ to the sample variance of $X_i$, or\n$$ \n\\widehat{\\beta}_{1} = \\frac{\\widehat{\\sigma}_{Y,X}}{\\widehat{\\sigma}^{2}_{X}} = \\frac{ \\frac{1}{n-1}\\sum_{i=1}^{n} (Y_{i} - \\overline{Y})(X_{i} - \\overline{X})}{\\frac{1}{n-1} \\sum_{i=1}^{n} (X_{i} - \\Xbar)^{2}}.\n$$\n\nThis plug-in approach is widely applicable and tends to have excellent properties in large samples under iid data. But this approach also hides some of the geometry of the setting. \n\nThe second approach applies the plug-in principle not to the closed-form expression for the coefficients but to the optimization problem itself. We call this the **least squares** estimator because it minimizes the empirical (or sample) squared prediction error,\n$$ \n\\bhat = \\argmin_{\\mb{b} \\in \\real^k}\\; \\frac{1}{n} \\sum_{i=1}^{n}\\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2 = \\argmin_{\\mb{b} \\in \\real^k}\\; SSR(\\mb{b}),\n$$\nwhere,\n$$ \nSSR(\\mb{b}) = \\sum_{i=1}^{n}\\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\n$$\nis the sum of the squared residuals. To distinguish it from other, more complicated least squares estimators, we call this the **ordinary least squares** estimator or OLS. \n\nLet's solve this minimization problem! We can write down the first-order conditions as\n$$ \n0=\\frac{\\partial SSR(\\bhat)}{\\partial \\bfbeta} = 2 \\left(\\sum_{i=1}^{n} \\X_{i}Y_{i}\\right) - 2\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right)\\bhat.\n$$\nWe can rearrange this system of equations to\n$$ \n\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right)\\bhat = \\left(\\sum_{i=1}^{n} \\X_{i}Y_{i}\\right).\n$$\nTo obtain the solution for $\\bhat$, notice that $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is a $(k+1) \\times (k+1)$ matrix and $\\bhat$ and $\\sum_{i=1}^{n} \\X_{i}Y_{i}$ are both $k+1$ length column vectors. If $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is invertible, then we can multiply both sides of this equation by that inverse to arrive at\n$$ \n\\bhat = \\left(\\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\sum_{i=1}^n \\X_{i}Y_{i} \\right),\n$$\nwhich is the same expression as the plug-in estimator (after canceling the $1/n$ terms). To confirm that we have found a minimum, we also need to check the second-order condition, \n$$ \n \\frac{\\partial^{2} SSR(\\bhat)}{\\partial \\bfbeta\\bfbeta'} = 2\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right) > 0.\n$$\nWhat does it mean for a matrix to be \"positive\"? In matrix algebra, this condition means that the matrix $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is **positive definite**, a condition that we discuss in @sec-rank. \n\n\nUsing the plug-in or least squares approaches, we arrive at the same estimator for the best linear predictor/population linear regression coefficients.\n\n::: {#thm-ols}\n\nIf the $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is positive definite, then the ordinary least squares estimator is\n$$\n\\bhat = \\left(\\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\sum_{i=1}^n \\X_{i}Y_{i} \\right).\n$$\n\n:::\n\n\n::: {.callout-note}\n\n## Formula for the OLS slopes\n\nAlmost all regression will contain an intercept term usually represented as a constant 1 in the covariate vector. It is also possible to separate the intercept to arrive at the set of coefficients on the \"real\" covariates:\n$$ \nY_{i} = \\alpha + \\X_{i}'\\bfbeta + \\e_{i}\n$$\nDefined this way, we can write the OLS estimator for the \"slopes\" on $\\X_i$ as the OLS estimator with all variables demeaned\n$$ \n\\bhat = \\left(\\frac{1}{n} \\sum_{i=1}^{n} (\\X_{i} - \\overline{\\X})(\\X_{i} - \\overline{\\X})'\\right) \\left(\\frac{1}{n} \\sum_{i=1}^{n}(\\X_{i} - \\overline{\\X})(Y_{i} - \\overline{Y})\\right)\n$$\nwhich is the inverse of the sample covariance matrix of $\\X_i$ times the sample covariance of $\\X_i$ and $Y_i$. The intercept is \n$$ \n\\widehat{\\alpha} = \\overline{Y} - \\overline{\\X}'\\bhat.\n$$\n\n:::\n\nWhen dealing with actual data, we refer to the prediction errors $\\widehat{e}_{i} = Y_i - \\X_i'\\bhat$ as the **residuals** and the predicted value itself, $\\widehat{Y}_i = \\X_{i}'\\bhat$ is also called the **fitted value**. With the population linear regression, we saw that the projection errors $e_i = Y_i - \\X_i'\\bfbeta$ were mean zero and uncorrelated with the covariates $\\E[\\X_{i}e_{i}] = 0$. The residuals have a similar property with respect to the covariates in the sample:\n$$ \n\\sum_{i=1}^n \\X_i\\widehat{e}_i = 0.\n$$\nThe residuals are *exactly* uncorrelated with the covariates (when the covariates include a constant/intercept term), which is mechanically true of the OLS estimator. \n\n\n@fig-ssr-comp shows how OLS works in the bivariate case. Here we see three possible regression lines and the sum of the squared residuals for each line. OLS aims to find the line that minimizes the function on the right. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![Different possible lines and their corresponding sum of squared residuals.](07_least_squares_files/figure-html/fig-ssr-comp-1.png){#fig-ssr-comp width=672}\n:::\n:::\n\n\n## Model fit\n\nWe have learned how to use OLS to obtain an estimate of the best linear predictor, but we may ask how good that prediction is. Does using $\\X_i$ help us predict $Y_i$? To investigate this, we can consider two different prediction errors: those using covariates and those that do not. \n\nWe have already seen the prediction error when using the covariates; it is just the **sum of the squared residuals** \n$$ \nSSR = \\sum_{i=1}^n (Y_i - \\X_{i}'\\bhat)^2.\n$$\nRecall that the best predictor for $Y_i$ without any covariates is simply its sample mean, $\\overline{Y}$ and so the prediction error without covariates is what we call the **total sum of squares**,\n$$ \nTSS = \\sum_{i=1}^n (Y_i - \\overline{Y})^2.\n$$\n@fig-ssr-vs-tss shows the difference between these two types of prediction errors. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Total sum of squares vs. the sum of squared residuals.](07_least_squares_files/figure-html/fig-ssr-vs-tss-1.png){#fig-ssr-vs-tss width=672}\n:::\n:::\n\n\nWe can use the **proportion reduction in prediction error** from adding those covariates to measure how much those covariates improve the regression's predictive ability. This value, called the **coefficient of determination** or $R^2$ is simply\n$$\nR^2 = \\frac{TSS - SSR}{TSS} = 1-\\frac{SSR}{TSS},\n$$\nwhich is the reduction in error moving from $\\overline{Y}$ to $\\X_i'\\bhat$ as the predictor relative to the prediction error using $\\overline{Y}$. We can think of this as the fraction of the total prediction error eliminated by using $\\X_i$ to predict $Y_i$. One thing to note is that OLS will *always* improve in-sample fit so that $TSS \\geq SSR$ even if $\\X_i$ is unrelated to $Y_i$. This phantom improvement occurs because the whole point of OLS is to minimize the SSR, and it will do that even if it is just chasing noise. \n\nSince regression always improves in-sample fit, $R^2$ will fall between 0 and 1. A value 0 zero would indicate exactly 0 estimated coefficients on all covariates (except the intercept) so that $Y_i$ and $\\X_i$ are perfectly orthogonal in the data (this is very unlikely to occur because there will likely be some minimal but nonzero relationship by random chance). A value of 1 indicates a perfect linear fit. \n\n## Matrix form of OLS\n\nWhile we derived the OLS estimator above, there is a much more common representation of the estimator that relies on vectors and matrices. We usually write the linear model for a generic unit, $Y_i = \\X_i'\\bfbeta + e_i$, but obviously, there are $n$ of these equations,\n$$ \n\\begin{aligned}\n Y_1 &= \\X_1'\\bfbeta + e_1 \\\\\n Y_2 &= \\X_2'\\bfbeta + e_2 \\\\\n &\\vdots \\\\\n Y_n &= \\X_n'\\bfbeta + e_n \\\\\n\\end{aligned}\n$$\nWe can write this system of equations in a more compact form using matrix algebra. In particular, let's combine the variables here into random vectors/matrices:\n$$\n\\mb{Y} = \\begin{pmatrix}\nY_1 \\\\ Y_2 \\\\ \\vdots \\\\ Y_n\n \\end{pmatrix}, \\quad\n \\mathbb{X} = \\begin{pmatrix}\n\\X'_1 \\\\\n\\X'_2 \\\\\n\\vdots \\\\\n\\X'_n\n \\end{pmatrix} =\n \\begin{pmatrix}\n 1 & X_{11} & X_{12} & \\cdots & X_{1k} \\\\\n 1 & X_{21} & X_{22} & \\cdots & X_{2k} \\\\\n \\vdots & \\vdots & \\vdots & \\vdots & \\vdots \\\\\n 1 & X_{n1} & X_{n2} & \\cdots & X_{nk} \\\\\n \\end{pmatrix},\n \\quad\n \\mb{e} = \\begin{pmatrix}\ne_1 \\\\ e_2 \\\\ \\vdots \\\\ e_n\n \\end{pmatrix}\n$$\nThen we can write the above system of equations as\n$$\n\\mb{Y} = \\mathbb{X}\\bfbeta + \\mb{e},\n$$\nwhere notice now that $\\mathbb{X}$ is a $n \\times (k+1)$ matrix and $\\bfbeta$ is a $k+1$ length column vector. \n\nA critical link between the definition of OLS above to the matrix notation comes from representing sums in matrix form. In particular, we have\n$$\n\\begin{aligned}\n \\sum_{i=1}^n \\X_i\\X_i' &= \\Xmat'\\Xmat \\\\\n \\sum_{i=1}^n \\X_iY_i &= \\Xmat'\\mb{Y},\n\\end{aligned}\n$$\nwhich means we can write the OLS estimator in the more recognizable form as \n$$ \n\\bhat = \\left( \\mathbb{X}'\\mathbb{X} \\right)^{-1} \\mathbb{X}'\\mb{Y}.\n$$\n\nOf course, we can also define the vector of residuals,\n$$ \n \\widehat{\\mb{e}} = \\mb{Y} - \\mathbb{X}\\bhat = \\left[\n\\begin{array}{c}\n Y_1 \\\\\n Y_2 \\\\\n \\vdots \\\\\n Y_n\n \\end{array}\n\\right] - \n\\left[\n\\begin{array}{c}\n 1\\widehat{\\beta}_0 + X_{11}\\widehat{\\beta}_1 + X_{12}\\widehat{\\beta}_2 + \\dots + X_{1k}\\widehat{\\beta}_k \\\\\n 1\\widehat{\\beta}_0 + X_{21}\\widehat{\\beta}_1 + X_{22}\\widehat{\\beta}_2 + \\dots + X_{2k}\\widehat{\\beta}_k \\\\\n \\vdots \\\\\n 1\\widehat{\\beta}_0 + X_{n1}\\widehat{\\beta}_1 + X_{n2}\\widehat{\\beta}_2 + \\dots + X_{nk}\\widehat{\\beta}_k\n\\end{array}\n\\right],\n$$\nand so the sum of the squared residuals, in this case, becomes\n$$ \nSSR(\\bfbeta) = \\Vert\\mb{Y} - \\mathbb{X}\\bfbeta\\Vert^{2} = (\\mb{Y} - \\mathbb{X}\\bfbeta)'(\\mb{Y} - \\mathbb{X}\\bfbeta),\n$$\nwhere the double vertical lines mean the Euclidean norm of the argument, $\\Vert \\mb{z} \\Vert = \\sqrt{\\sum_{i=1}^n z_i^{2}}$. The OLS minimization problem, then, is \n$$ \n\\bhat = \\argmin_{\\mb{b} \\in \\mathbb{R}^{(k+1)}}\\; \\Vert\\mb{Y} - \\mathbb{X}\\mb{b}\\Vert^{2}\n$$\nFinally, we can write the orthogonality of the covariates and the residuals as\n$$ \n\\mathbb{X}'\\widehat{\\mb{e}} = \\sum_{i=1}^{n} \\X_{i}\\widehat{e}_{i} = 0.\n$$\n\n## Rank, linear independence, and multicollinearity {#sec-rank}\n\nWhen introducing the OLS estimator, we noted that it would exist when $\\sum_{i=1}^n \\X_i\\X_i'$ is positive definite or that there is \"no multicollinearity.\" This assumption is equivalent to saying the matrix $\\mathbb{X}$ is full column rank, meaning that $\\text{rank}(\\mathbb{X}) = (k+1)$, where $k+1$ is the number of columns of $\\mathbb{X}$. Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and **linear independence** means that if $\\mathbb{X}\\mb{b} = 0$ if and only if $\\mb{b}$ is a column vector of 0s. In other words, we have\n$$ \nb_{1}\\mathbb{X}_{1} + b_{2}\\mathbb{X}_{2} + \\cdots + b_{k+1}\\mathbb{X}_{k+1} = 0 \\quad\\iff\\quad b_{1} = b_{2} = \\cdots = b_{k+1} = 0, \n$$\nwhere $\\mathbb{X}_j$ is the $j$th column of $\\mathbb{X}$. Thus, full column rank says that all the columns are linearly independent or that there is no \"multicollinearity.\"\n\nHow could this be violated? Suppose we accidentally included a linear function of one variable so that $\\mathbb{X}_2 = 2\\mathbb{X}_1$. Then we have,\n$$ \n\\begin{aligned}\n \\mathbb{X}\\mb{b} &= b_{1}\\mathbb{X}_{1} + b_{2}2\\mathbb{X}_1+ b_{3}\\mathbb{X}_{3}+ \\cdots + b_{k+1}\\mathbb{X}_{k+1} \\\\\n &= (b_{1} + 2b_{2})\\mathbb{X}_{1} + b_{3}\\mathbb{X}_{3} + \\cdots + b_{k+1}\\mathbb{X}_{k+1}\n\\end{aligned}\n$$\nIn this case, this expression equals 0 when $b_3 = b_4 = \\cdots = b_{k+1} = 0$ and $b_1 = -2b_2$. Thus, the collection of columns is linearly dependent, so we know that the rank of $\\mathbb{X}$ must be less than full column rank (that is, less than $k+1$). Hopefully, it is also clear that if we removed the problematic column $\\mathbb{X}_2$, the resulting matrix would have $k$ linearly independent columns, implying that $\\mathbb{X}$ is rank $k$. \n\nWhy does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that $\\Xmat$ if of full column rank if and only if $\\Xmat'\\Xmat$ is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of $\\Xmat$ being linearly independent means that the inverse $(\\Xmat'\\Xmat)^{-1}$ exists and so does $\\bhat$. Further, this full rank condition also implies that $\\Xmat'\\Xmat = \\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.\n\nWhat are common situations that lead to violations of no multicollinearity? We have seen one above, with one variable being a linear function of another. But this problem can come out in more subtle ways. Suppose that we have a set of dummy variables corresponding to a single categorical variable, like the region of the country. In the US, this might mean we have $X_{i1} = 1$ for units in the West (0 otherwise), $X_{i2} = 1$ for units in the Midwest (0 otherwise), $X_{i3} = 1$ for units in the South (0 otherwise), and $X_{i4} = 1$ for units in the Northeast (0 otherwise). Each unit has to be in one of these four regions, so there is a linear dependence between these variables, \n$$ \nX_{i4} = 1 - X_{i1} - X_{i2} - X_{i3}.\n$$\nThat is, if I know that you are not in the West, Midwest, or South regions, I know that you are in the Northeast. We would get a linear dependence if we tried to include all of these variables in our regression with an intercept. (Note the 1 in the relationship between $X_{i4}$ and the other variables, that's why there will be linear dependence when including a constant.) Thus, we usually omit one dummy variable from each categorical variable. In that case, the coefficients on the remaining dummies are differences in means between that category and the omitted one (perhaps conditional on other variables included, if included). So if we omitted $X_{i4}$, then the coefficient on $X_{i1}$ would be the difference in mean outcomes between units in the West and Northeast regions. \n\nAnother way collinearity can occur is if you include both an intercept term and a variable that does not vary. This issue can often happen if we mistakenly subset our data to, say, the West region but still include the West dummy variable in the regression. \n\nFinally, note that most statistical software packages will \"solve\" the multicollinearity by arbitrarily removing as many linearly dependent covariates as is necessary to achieve full rank. R will show the estimated coefficients as `NA` in those cases. \n\n## OLS coefficients for binary and categorical regressors\n\nSuppose that the covariates include just the intercept and a single binary variable, $\\X_i = (1\\; X_{i})'$, where $X_i \\in \\{0,1\\}$. In this case, the OLS coefficient on $X_i$, $\\widehat{\\beta_{1}}$, is exactly equal to the difference in sample means of $Y_i$ in the $X_i = 1$ group and the $X_i = 0$ group:\n$$ \n\\widehat{\\beta}_{1} = \\frac{\\sum_{i=1}^{n} X_{i}Y_{i}}{\\sum_{i=1}^{n} X_{i}} - \\frac{\\sum_{i=1}^{n} (1 - X_{i})Y_{i}}{\\sum_{i=1}^{n} 1- X_{i}} = \\overline{Y}_{X =1} - \\overline{Y}_{X=0}\n$$\nThis result is not an approximation. It holds exactly for any sample size. \n\nWe can generalize this idea to discrete variables more broadly. Suppose we have our region variables from the last section and include in our covariates a constant and the dummies for the West, Midwest, and South regions. Then coefficient on the West dummy will be\n$$ \n\\widehat{\\beta}_{\\text{west}} = \\overline{Y}_{\\text{west}} - \\overline{Y}_{\\text{northeast}},\n$$\nwhich is exactly the difference in sample means of $Y_i$ between the West region and units in the \"omitted region,\" the Northeast. \n\nNote that these interpretations only hold when the regression consists solely of the binary variable or the set of categorical dummy variables. These exact relationships fail when other covariates are added to the model. \n\n\n\n## Projection and geometry of least squares\n\nOLS has a very nice geometric interpretation that can add a lot of intuition for various aspects of the method. In this geometric approach, we view $\\mb{Y}$ as an $n$-dimensional vector in $\\mathbb{R}^n$. As we saw above, OLS in matrix form is about finding a linear combination of the covariate matrix $\\Xmat$ closest to this vector in terms of the Euclidean distance (which is just the sum of squares). \n\nLet $\\mathcal{C}(\\Xmat) = \\{\\Xmat\\mb{b} : \\mb{b} \\in \\mathbb{R}^2\\}$ be the **column space** of the matrix $\\Xmat$. This set is all linear combinations of the columns of $\\Xmat$ or the set of all possible linear predictions we could obtain from $\\Xmat$. Notice that the OLS fitted values, $\\Xmat\\bhat$, is in this column space. If, as we assume, $\\Xmat$ has full column rank of $k+1$, then the column space $\\mathcal{C}(\\Xmat)$ will be a $k+1$-dimensional surface inside of the larger $n$-dimensional space. If $\\Xmat$ has two columns, the column space will be a plane. \n\nAnother interpretation of the OLS estimator is that it finds the linear predictor as the closest point in the column space of $\\Xmat$ to the outcome vector $\\mb{Y}$. This is called the **projection** of $\\mb{Y}$ onto $\\mathcal{C}(\\Xmat)$. @fig-projection shows this projection for a case with $n=3$ and 2 columns in $\\Xmat$. The shaded blue region represents the plane of the column space of $\\Xmat$, and we can see that $\\Xmat\\bhat$ is the closest point to $\\mb{Y}$ in that space. That's the whole idea of the OLS estimator: find the linear combination of the columns of $\\Xmat$ (a point in the column space) that minimizes the Euclidean distance between that point and the outcome vector (the sum of squared residuals).\n\n![Projection of Y on the column space of the covariates.](assets/img/projection-drawing.png){#fig-projection}\n\nThis figure shows that the residual vector, which is the difference between the $\\mb{Y}$ vector and the projection $\\Xmat\\bhat$ is perpendicular or orthogonal to the column space of $\\Xmat$. This orthogonality is a consequence of the residuals being orthogonal to all the columns of $\\Xmat$,\n$$ \n\\Xmat'\\mb{e} = 0,\n$$\nas we established above. Being orthogonal to all the columns means it will also be orthogonal to all linear combinations of the columns. \n\n## Projection and annihilator matrices\n\nNow that we have the idea of projection to the column space of $\\Xmat$, we can define a way to project any vector into that space. The $n\\times n$ **projection matrix**\n$$\n\\mb{P}_{\\Xmat} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat',\n$$\nprojects a vector into $\\mathcal{C}(\\Xmat)$. In particular, we can see that this gives us the fitted values for $\\mb{Y}$:\n$$ \n\\mb{P}_{\\Xmat}\\mb{Y} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'\\mb{Y} = \\Xmat\\bhat.\n$$\nBecause we sometimes write the linear predictor as $\\widehat{\\mb{Y}} = \\Xmat\\bhat$, the projection matrix is also called the **hat matrix**. With either name, multiplying a vector by $\\mb{P}_{\\Xmat}$ gives the best linear predictor of that vector as a function of $\\Xmat$. Intuitively, any vector that is already a linear combination of the columns of $\\Xmat$ (so is in $\\mathcal{C}(\\Xmat)$) should be unaffected by this projection: the closest point in $\\mathcal{C}(\\Xmat)$ to a point already in $\\mathcal{C}(\\Xmat)$ is itself. We can also see this algebraically for any linear combination $\\Xmat\\mb{c}$\n$$\n\\mb{P}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'\\Xmat\\mb{c} = \\Xmat\\mb{c}\n$$\nbecause $(\\Xmat'\\Xmat)^{-1} \\Xmat'\\Xmat$ simplifies to the identity matrix. In particular, the projection of $\\Xmat$ onto itself is just itself: $\\mb{P}_{\\Xmat}\\Xmat = \\Xmat$. \n\nThe second matrix related to projection is the **annihilator matrix**, \n$$ \n\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\mb{P}_{\\Xmat},\n$$\nwhich projects any vector into the orthogonal complement to the column space of $\\Xmat$, \n$$\n\\mathcal{C}^{\\perp}(\\Xmat) = \\{\\mb{c} \\in \\mathbb{R}^n\\;:\\; \\Xmat\\mb{c} = 0 \\},\n$$\nThis matrix is called the annihilator matrix because if you apply it to any linear combination of $\\Xmat$, you get 0:\n$$ \n\\mb{M}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat\\mb{c} - \\mb{P}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat\\mb{c} - \\Xmat\\mb{c} = 0,\n$$\nand in particular, $\\mb{M}_{\\Xmat}\\Xmat = 0$. Why should we care about this matrix? Perhaps a more evocative name might be the **residual maker** since it makes residuals when applied to $\\mb{Y}$,\n$$ \n\\mb{M}_{\\Xmat}\\mb{Y} = (\\mb{I}_{n} - \\mb{P}_{\\Xmat})\\mb{Y} = \\mb{Y} - \\mb{P}_{\\Xmat}\\mb{Y} = \\mb{Y} - \\Xmat\\bhat = \\widehat{\\mb{e}}.\n$$\n\n\n\nThere are several fundamental property properties of the projection matrix that are useful: \n\n- $\\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}$ are **idempotent**, which means that when applied to itself, it simply returns itself: $\\mb{P}_{\\Xmat}\\mb{P}_{\\Xmat} = \\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}\\mb{M}_{\\Xmat} = \\mb{M}_{\\Xmat}$. \n\n- $\\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}$ are symmetric $n \\times n$ matrices so that $\\mb{P}_{\\Xmat}' = \\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}' = \\mb{M}_{\\Xmat}$.\n\n- The rank of $\\mb{P}_{\\Xmat}$ is $k+1$ (the number of columns of $\\Xmat$) and the rank of $\\mb{M}_{\\Xmat}$ is $n - k - 1$. \n\nWe can use the projection and annihilator matrices to arrive at an orthogonal decomposition of the outcome vector:\n$$ \n\\mb{Y} = \\Xmat\\bhat + \\widehat{\\mb{e}} = \\mb{P}_{\\Xmat}\\mb{Y} + \\mb{M}_{\\Xmat}\\mb{Y}.\n$$\n \n\n\n::: {.content-hidden}\n\n## Trace of a matrix\n\nRecall that the trace of a $k \\times k$ square matrix, $\\mb{A} = {a_{ij}}$, is sum the sum of the diagonal entries,\n$$\n\\text{trace}(\\mb{A}) = \\sum_{i=1}^{k} a_{ii},\n$$\nso, for example, $\\text{trace}(\\mb{I}_{n}) = n$. A couple of key properties of the trace:\n\n- Trace is linear: $\\text{trace}(k\\mb{A}) = k\\; \\text{trace}(\\mb{a})$ and $\\text{trace}(\\mb{A} + \\mb{B}) = \\text{trace}(\\mb{A}) + \\text{trace}(\\mb{B})$\n- Trace is invariant to multiplication direction: $\\text{trace}(\\mb{AB}) = \\text{trace}(\\mb{BA})$. \n:::\n\n\n## Residual regression\n\nThere are many situations where we can partition the covariates into two groups, and we might wonder if it is possible how to express or calculate the OLS coefficients for just one set of covariates. In particular, let the columns of $\\Xmat$ be partitioned into $[\\Xmat_{1} \\Xmat_{2}]$, so that linear prediction we are estimating is \n$$ \n\\mb{Y} = \\Xmat_{1}\\bfbeta_{1} + \\Xmat_{2}\\bfbeta_{2} + \\mb{e}, \n$$\nwith estimated coefficients and residuals\n$$ \n\\mb{Y} = \\Xmat_{1}\\bhat_{1} + \\Xmat_{2}\\bhat_{2} + \\widehat{\\mb{e}}.\n$$\n\nWe now document another way to obtain the estimator $\\bhat_1$ from this regression using a technique called **residual regression**, **partitioned regression**, or the **Frisch-Waugh-Lovell theorem**.\n \n::: {.callout-note}\n\n## Residual regression approach \n\nThe residual regression approach is:\n\n1. Use OLS to regress $\\mb{Y}$ on $\\Xmat_2$ and obtain residuals $\\widetilde{\\mb{e}}_2$. \n2. Use OLS to regress each column of $\\Xmat_1$ on $\\Xmat_2$ and obtain residuals $\\widetilde{\\Xmat}_1$.\n3. Use OLS to regression $\\widetilde{\\mb{e}}_{2}$ on $\\widetilde{\\Xmat}_1$. \n\n:::\n\n::: {#thm-fwl}\n\n## Frisch-Waugh-Lovell\n\nThe OLS coefficients from a regression of $\\widetilde{\\mb{e}}_{2}$ on $\\widetilde{\\Xmat}_1$ are equivalent to the coefficients on $\\Xmat_{1}$ from the regression of $\\mb{Y}$ on both $\\Xmat_{1}$ and $\\Xmat_2$. \n\n:::\n\nOne implication of this theorem is the regression coefficient for a given variable captures the relationship between the residual variation in the outcome and that variable after accounting for the other covariates. In particular, this coefficient focuses on the variation orthogonal to those other covariates. \n\nWhile perhaps unexpected, this result may not appear particularly useful. We can just run the long regression, right? This trick can be handy when $\\Xmat_2$ consists of dummy variables (or \"fixed effects\") for a categorical variable with many categories. For example, suppose $\\Xmat_2$ consists of indicators for the county of residence for a respondent. In that case, that will have over 3,000 columns, meaning that direct calculation of the $\\bhat = (\\bhat_{1}, \\bhat_{2})$ will require inverting a matrix that is bigger than $3,000 \\times 3,000$. Computationally, this process will be very slow. But above, we saw that predictions of an outcome on a categorical variable are just the sample mean within each level of the variable. Thus, in this case, the residuals $\\widetilde{\\mb{e}}_2$ and $\\Xmat_1$ can be computed by demeaning the outcome and $\\Xmat_1$ within levels of the dummies in $\\Xmat_2$, which can be considerably faster computationally. \n\nFinally, there are data visualization reasons to use residual regression. It is often difficult to see if the linear functional form for some covariate is appropriate once you begin to control for other variables. One can check the relationship using this approach with a scatterplot of $\\widetilde{\\mb{e}}_2$ on $\\Xmat_1$ (when it is a single column). \n\n\n## Outliers, leverage points, and influential observations\n\nGiven that OLS finds the coefficients that minimize the sum of the squared residuals, it is helpful to ask how much impact each residual has on that solution. Let $\\bhat_{(-i)}$ be the OLS estimates if we omit unit $i$. Intuitively, **influential observations** should significantly impact the estimated coefficients so that $\\bhat_{(-i)} - \\bhat$ is large in absolute value. \n\nUnder what conditions will we have influential observations? OLS tries to minimize the sum of **squared** residuals, so it will move more to shrink larger residuals than smaller ones. Where are large residuals likely to occur? Well, notice that any OLS regression line with a constant will go through the means of the outcome and the covariates: $\\overline{Y} = \\overline{\\X}\\bhat$. Thus, by definition, This means that when an observation is close to the average of the covariates, $\\overline{\\X}$, it cannot have that much influence because OLS forces the regression line to go through $\\overline{Y}$. Thus, we should look for influential points that have two properties:\n\n1. Have high **leverage**, where leverage roughly measures how far $\\X_i$ is from $\\overline{\\X}$, and\n2. Be an **outlier** in the sense of having a large residual (if left out of the regression).\n\nWe'll take each of these in turn. \n\n### Leverage points {#sec-leverage}\n\nWe can define the **leverage** of an observation by\n$$ \nh_{ii} = \\X_{i}'\\left(\\Xmat'\\Xmat\\right)^{-1}\\X_{i},\n$$\nwhich is the $i$th diagonal entry of the projection matrix, $\\mb{P}_{\\Xmat}$. Notice that \n$$ \n\\widehat{\\mb{Y}} = \\mb{P}\\mb{Y} \\qquad \\implies \\qquad \\widehat{Y}_i = \\sum_{j=1}^n h_{ij}Y_j,\n$$\nso that $h_{ij}$ is the importance of observation $j$ for the fitted value for observation $i$. The leverage, then, is the importance of the observation for its own fitted value. We can also interpret these values in terms of the distribution of $\\X_{i}$. Roughly speaking, these values are the weighted distance $\\X_i$ is from $\\overline{\\X}$, where the weights normalize to the empirical variance/covariance structure of the covariates (so that the scale of each covariate is roughly the same). We can see this most clearly when we fit a simple linear regression (with one covariate and an intercept) with OLS when the leverage is\n$$ \nh_{ii} = \\frac{1}{n} + \\frac{(X_i - \\overline{X})^2}{\\sum_{j=1}^n (X_j - \\overline{X})^2}\n$$\n\nLeverage values have three key properties:\n\n1. $0 \\leq h_{ii} \\leq 1$\n2. $h_{ii} \\geq 1/n$ if the model contains an intercept\n2. $\\sum_{i=1}^{n} h_{ii} = k + 1$\n\n### Outliers and leave-one-out regression\n\nIn the context of OLS, an **outlier** is an observation with a large prediction error for a particular OLS specification. @fig-outlier shows an example of an outlier. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![An example of an outlier.](07_least_squares_files/figure-html/fig-outlier-1.png){#fig-outlier width=672}\n:::\n:::\n\n\nIntuitively, it seems as though we could use the residual $\\widehat{e}_i$ to assess the prediction error for a given unit. But the residuals are not valid predictions because the OLS estimator is designed to make those as small as possible (in machine learning parlance, these were in the training set). In particular, if an outlier is influential, we already noted that it might \"pull\" the regression line toward it, and the resulting residual might be pretty small. \n\nTo assess prediction errors more cleanly, we can use **leave-one-out regression** (LOO), which regresses$\\mb{Y}_{(-i)}$ on $\\Xmat_{(-i)}$, where these omit unit $i$:\n$$ \n\\bhat_{(-i)} = \\left(\\Xmat'_{(-i)}\\Xmat_{(-i)}\\right)^{-1}\\Xmat_{(-i)}\\mb{Y}_{(-i)}.\n$$\nWe can then calculate LOO prediction errors as\n$$ \n\\widetilde{e}_{i} = Y_{i} - \\X_{i}'\\bhat_{(-i)}.\n$$\nCalculating these LOO prediction errors for each unit appears to be computationally costly because it seems as though we have to fit OLS $n$ times. Fortunately, there is a closed-form expression for the LOO coefficients and prediction errors in terms of the original regression, \n$$ \n\\bhat_{(-i)} = \\bhat - \\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i \\qquad \\widetilde{e}_i = \\frac{\\widehat{e}_i}{1 - h_{ii}}.\n$$ {#eq-loo-coefs}\nWe can see from this that the LOO prediction errors will differ from the residuals when the leverage of a unit is high. This makes sense! We said earlier that observations with low leverage would be close to $\\overline{\\X}$, where the outcome values have relatively little impact on the OLS fit (because the regression line must go through $\\overline{Y}$). \n\n### Influence points\n\nAn influence point is an observation that has the power to change the coefficients and fitted values for a particular OLS specification. @fig-influence shows an example of such an influence point. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![An example of an influence point.](07_least_squares_files/figure-html/fig-influence-1.png){#fig-influence width=672}\n:::\n:::\n\n\nOne measure of influence is called DFBETA$_i$ measures how much $i$ changes the estimated coefficient vector\n$$ \n\\bhat - \\bhat_{(-i)} = \\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i,\n$$\nso there is one value for each observation-covariate pair. When divided by the standard error of the estimated coefficients, this is called DFBETA**S** (where the \"S\" is for standardized). These are helpful if we focus on a particular coefficient. \n\n\nWhen we want to summarize how much an observation matters for the fit, we can use a compact measure of the influence of an observation by comparing the fitted value from the entire sample to the fitted value from the leave-one-out regression. Using the DFBETA above, we have\n$$ \n\\widehat{Y}_i - \\X_{i}\\bhat_{(-1)} = \\X_{i}'(\\bhat -\\bhat_{(-1)}) = \\X_{i}'\\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i = h_{ii}\\widetilde{e}_i,\n$$\nso the influence of an observation is its leverage times how much of an outlier it is. This value is sometimes called DFFIT (difference in fit). One transformation of this quantity, **Cook's distance**, standardizes this by the sum of the squared residuals:\n$$ \nD_i = \\frac{n-k-1}{k+1}\\frac{h_{ii}\\widetilde{e}_{i}^{2}}{\\widehat{\\mb{e}}'\\widehat{\\mb{e}}}.\n$$\nVarious rules exist for establishing cutoffs for identifying an observation as \"influential\" based on these metrics, but they tend to be ad hoc. In any case, it's better to focus on the holistic question of \"how much does this observation matter for my substantive interpretation\" rather than the narrow question of a particular threshold. \n\n\nIt's all well and good to find influential points, but what should you do about it? The first thing to check is that the data is not corrupted somehow. Sometimes influence points occur because of a coding or data entry error. If you have control over that coding, you should fix those errors. You may consider removing the observation if the error appears in the data acquired from another source. Still, when writing up your analyses, you should be extremely clear about this choice. Another approach is to consider a transformation of the dependent or independent variables, like the natural logarithm, that might dampen the effects of outliers. Finally, consider using methods that are robust to outliers. \n", + "markdown": "# The mechanics of least squares {#sec-ols-mechanics}\n\nThis chapter explores the most widely used estimator for population linear regressions: **ordinary least squares** (OLS). OLS is a plug-in estimator for the best linear projection (or population linear regression) described in the last chapter. Its popularity is partly due to its ease of interpretation, computational simplicity, and statistical efficiency. \n\nIn this chapter, we focus on motivating the estimator and the mechanical or algebraic properties of the OLS estimator. In the next chapter, we will investigate its statistical assumptions. Textbooks often introduce OLS under an assumption of a linear model for the conditional expectation, but this is unnecessary if we view the inference target as the best linear predictor. We discuss this point more fully in the next chapter. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Relationship between political institutions and economic development from Acemoglu, Johnson, and Robinson (2001).](07_least_squares_files/figure-html/fig-ajr-scatter-1.png){#fig-ajr-scatter width=672}\n:::\n:::\n\n\n\n\n## Deriving the OLS estimator \n\nIn the last chapter on the linear model and the best linear projection, we operated purely in the population, not samples. We derived the population regression coefficients $\\bfbeta$, representing the coefficients on the line of best fit in the population. We now take these as our quantity of interest. \n\n::: {.callout-note}\n## Assumption\n\n\nThe variables $\\{(Y_1, \\X_1), \\ldots, (Y_i,\\X_i), \\ldots, (Y_n, \\X_n)\\}$ are i.i.d. draws from a common distribution $F$.\n\n:::\n\nRecall the population linear coefficients (or best linear predictor coefficients) that we derived in the last chapter,\n$$ \n\\bfbeta = \\argmin_{\\mb{b} \\in \\real^k}\\; \\E\\bigl[ \\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\\bigr] = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}]\n$$\n\nWe will consider two different ways to derive the OLS estimator for these coefficients, both of which are versions of the plug-in principle. The first approach is to use the closed-form representation of the coefficients and replace any expectations with sample means,\n$$ \n\\bhat = \\left(\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\frac{1}{n} \\sum_{i=1}^n \\X_{i}Y_{i} \\right),\n$$\nwhich exists if $\\sum_{i=1}^n \\X_i\\X_i'$ is **positive definite** and thus invertible. We will return to this assumption below. \n\n\nIn a simple bivariate linear projection model $m(X_{i}) = \\beta_0 + \\beta_1X_{i}$, we saw that the population slope was $\\beta_1= \\text{cov}(Y_{i},X_{i})/ \\V[X_{i}]$ and this approach would have our estimator for the slope be the ratio of the sample covariance of $Y_i$ and $X_i$ to the sample variance of $X_i$, or\n$$ \n\\widehat{\\beta}_{1} = \\frac{\\widehat{\\sigma}_{Y,X}}{\\widehat{\\sigma}^{2}_{X}} = \\frac{ \\frac{1}{n-1}\\sum_{i=1}^{n} (Y_{i} - \\overline{Y})(X_{i} - \\overline{X})}{\\frac{1}{n-1} \\sum_{i=1}^{n} (X_{i} - \\Xbar)^{2}}.\n$$\n\nThis plug-in approach is widely applicable and tends to have excellent properties in large samples under iid data. But this approach also hides some of the geometry of the setting. \n\nThe second approach applies the plug-in principle not to the closed-form expression for the coefficients but to the optimization problem itself. We call this the **least squares** estimator because it minimizes the empirical (or sample) squared prediction error,\n$$ \n\\bhat = \\argmin_{\\mb{b} \\in \\real^k}\\; \\frac{1}{n} \\sum_{i=1}^{n}\\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2 = \\argmin_{\\mb{b} \\in \\real^k}\\; SSR(\\mb{b}),\n$$\nwhere,\n$$ \nSSR(\\mb{b}) = \\sum_{i=1}^{n}\\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\n$$\nis the sum of the squared residuals. To distinguish it from other, more complicated least squares estimators, we call this the **ordinary least squares** estimator or OLS. \n\nLet's solve this minimization problem! We can write down the first-order conditions as\n$$ \n0=\\frac{\\partial SSR(\\bhat)}{\\partial \\bfbeta} = 2 \\left(\\sum_{i=1}^{n} \\X_{i}Y_{i}\\right) - 2\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right)\\bhat.\n$$\nWe can rearrange this system of equations to\n$$ \n\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right)\\bhat = \\left(\\sum_{i=1}^{n} \\X_{i}Y_{i}\\right).\n$$\nTo obtain the solution for $\\bhat$, notice that $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is a $(k+1) \\times (k+1)$ matrix and $\\bhat$ and $\\sum_{i=1}^{n} \\X_{i}Y_{i}$ are both $k+1$ length column vectors. If $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is invertible, then we can multiply both sides of this equation by that inverse to arrive at\n$$ \n\\bhat = \\left(\\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\sum_{i=1}^n \\X_{i}Y_{i} \\right),\n$$\nwhich is the same expression as the plug-in estimator (after canceling the $1/n$ terms). To confirm that we have found a minimum, we also need to check the second-order condition, \n$$ \n \\frac{\\partial^{2} SSR(\\bhat)}{\\partial \\bfbeta\\bfbeta'} = 2\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right) > 0.\n$$\nWhat does it mean for a matrix to be \"positive\"? In matrix algebra, this condition means that the matrix $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is **positive definite**, a condition that we discuss in @sec-rank. \n\n\nUsing the plug-in or least squares approaches, we arrive at the same estimator for the best linear predictor/population linear regression coefficients.\n\n::: {#thm-ols}\n\nIf the $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is positive definite, then the ordinary least squares estimator is\n$$\n\\bhat = \\left(\\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\sum_{i=1}^n \\X_{i}Y_{i} \\right).\n$$\n\n:::\n\n\n::: {.callout-note}\n\n## Formula for the OLS slopes\n\nAlmost all regression will contain an intercept term usually represented as a constant 1 in the covariate vector. It is also possible to separate the intercept to arrive at the set of coefficients on the \"real\" covariates:\n$$ \nY_{i} = \\alpha + \\X_{i}'\\bfbeta + \\e_{i}.\n$$\nDefined this way, we can write the OLS estimator for the \"slopes\" on $\\X_i$ as the OLS estimator with all variables demeaned\n$$ \n\\bhat = \\left(\\frac{1}{n} \\sum_{i=1}^{n} (\\X_{i} - \\overline{\\X})(\\X_{i} - \\overline{\\X})'\\right) \\left(\\frac{1}{n} \\sum_{i=1}^{n}(\\X_{i} - \\overline{\\X})(Y_{i} - \\overline{Y})\\right)\n$$\nwhich is the inverse of the sample covariance matrix of $\\X_i$ times the sample covariance of $\\X_i$ and $Y_i$. The intercept is \n$$ \n\\widehat{\\alpha} = \\overline{Y} - \\overline{\\X}'\\bhat.\n$$\n\n:::\n\nWhen dealing with actual data, we refer to the prediction errors $\\widehat{e}_{i} = Y_i - \\X_i'\\bhat$ as the **residuals** and the predicted value itself, $\\widehat{Y}_i = \\X_{i}'\\bhat$, is also called the **fitted value**. With the population linear regression, we saw that the projection errors $e_i = Y_i - \\X_i'\\bfbeta$ were mean zero and uncorrelated with the covariates $\\E[\\X_{i}e_{i}] = 0$. The residuals have a similar property with respect to the covariates in the sample:\n$$ \n\\sum_{i=1}^n \\X_i\\widehat{e}_i = 0.\n$$\nThe residuals are *exactly* uncorrelated with the covariates (when the covariates include a constant/intercept term), which is mechanically true of the OLS estimator. \n\n\n@fig-ssr-comp shows how OLS works in the bivariate case. Here we see three possible regression lines and the sum of the squared residuals for each line. OLS aims to find the line that minimizes the function on the right. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![Different possible lines and their corresponding sum of squared residuals.](07_least_squares_files/figure-html/fig-ssr-comp-1.png){#fig-ssr-comp width=672}\n:::\n:::\n\n\n## Model fit\n\nWe have learned how to use OLS to obtain an estimate of the best linear predictor, but we may ask how good that prediction is. Does using $\\X_i$ help us predict $Y_i$? To investigate this, we can consider two different prediction errors: those using covariates and those that do not. \n\nWe have already seen the prediction error when using the covariates; it is just the **sum of the squared residuals**,\n$$ \nSSR = \\sum_{i=1}^n (Y_i - \\X_{i}'\\bhat)^2.\n$$\nRecall that the best predictor for $Y_i$ without any covariates is simply its sample mean $\\overline{Y}$, and so the prediction error without covariates is what we call the **total sum of squares**,\n$$ \nTSS = \\sum_{i=1}^n (Y_i - \\overline{Y})^2.\n$$\n@fig-ssr-vs-tss shows the difference between these two types of prediction errors. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Total sum of squares vs. the sum of squared residuals.](07_least_squares_files/figure-html/fig-ssr-vs-tss-1.png){#fig-ssr-vs-tss width=672}\n:::\n:::\n\n\nWe can use the **proportion reduction in prediction error** from adding those covariates to measure how much those covariates improve the regression's predictive ability. This value, called the **coefficient of determination** or $R^2$, is simply\n$$\nR^2 = \\frac{TSS - SSR}{TSS} = 1-\\frac{SSR}{TSS},\n$$\nwhich is the reduction in error moving from $\\overline{Y}$ to $\\X_i'\\bhat$ as the predictor relative to the prediction error using $\\overline{Y}$. We can think of this as the fraction of the total prediction error eliminated by using $\\X_i$ to predict $Y_i$. One thing to note is that OLS will *always* improve in-sample fit so that $TSS \\geq SSR$ even if $\\X_i$ is unrelated to $Y_i$. This phantom improvement occurs because the whole point of OLS is to minimize the SSR, and it will do that even if it is just chasing noise. \n\nSince regression always improves in-sample fit, $R^2$ will fall between 0 and 1. A value 0 zero would indicate exactly 0 estimated coefficients on all covariates (except the intercept) so that $Y_i$ and $\\X_i$ are perfectly orthogonal in the data (this is very unlikely to occur because there will likely be some minimal but nonzero relationship by random chance). A value of 1 indicates a perfect linear fit. \n\n## Matrix form of OLS\n\nWhile we derived the OLS estimator above, there is a much more common representation of the estimator that relies on vectors and matrices. We usually write the linear model for a generic unit, $Y_i = \\X_i'\\bfbeta + e_i$, but obviously, there are $n$ of these equations,\n$$ \n\\begin{aligned}\n Y_1 &= \\X_1'\\bfbeta + e_1 \\\\\n Y_2 &= \\X_2'\\bfbeta + e_2 \\\\\n &\\vdots \\\\\n Y_n &= \\X_n'\\bfbeta + e_n \\\\\n\\end{aligned}\n$$\nWe can write this system of equations in a more compact form using matrix algebra. In particular, let's combine the variables here into random vectors/matrices:\n$$\n\\mb{Y} = \\begin{pmatrix}\nY_1 \\\\ Y_2 \\\\ \\vdots \\\\ Y_n\n \\end{pmatrix}, \\quad\n \\mathbb{X} = \\begin{pmatrix}\n\\X'_1 \\\\\n\\X'_2 \\\\\n\\vdots \\\\\n\\X'_n\n \\end{pmatrix} =\n \\begin{pmatrix}\n 1 & X_{11} & X_{12} & \\cdots & X_{1k} \\\\\n 1 & X_{21} & X_{22} & \\cdots & X_{2k} \\\\\n \\vdots & \\vdots & \\vdots & \\vdots & \\vdots \\\\\n 1 & X_{n1} & X_{n2} & \\cdots & X_{nk} \\\\\n \\end{pmatrix},\n \\quad\n \\mb{e} = \\begin{pmatrix}\ne_1 \\\\ e_2 \\\\ \\vdots \\\\ e_n\n \\end{pmatrix}\n$$\nThen we can write the above system of equations as\n$$\n\\mb{Y} = \\mathbb{X}\\bfbeta + \\mb{e},\n$$\nwhere notice now that $\\mathbb{X}$ is an $n \\times (k+1)$ matrix and $\\bfbeta$ is a $k+1$ length column vector. \n\nA critical link between the definition of OLS above to the matrix notation comes from representing sums in matrix form. In particular, we have\n$$\n\\begin{aligned}\n \\sum_{i=1}^n \\X_i\\X_i' &= \\Xmat'\\Xmat \\\\\n \\sum_{i=1}^n \\X_iY_i &= \\Xmat'\\mb{Y},\n\\end{aligned}\n$$\nwhich means we can write the OLS estimator in the more recognizable form as \n$$ \n\\bhat = \\left( \\mathbb{X}'\\mathbb{X} \\right)^{-1} \\mathbb{X}'\\mb{Y}.\n$$\n\nOf course, we can also define the vector of residuals,\n$$ \n \\widehat{\\mb{e}} = \\mb{Y} - \\mathbb{X}\\bhat = \\left[\n\\begin{array}{c}\n Y_1 \\\\\n Y_2 \\\\\n \\vdots \\\\\n Y_n\n \\end{array}\n\\right] - \n\\left[\n\\begin{array}{c}\n 1\\widehat{\\beta}_0 + X_{11}\\widehat{\\beta}_1 + X_{12}\\widehat{\\beta}_2 + \\dots + X_{1k}\\widehat{\\beta}_k \\\\\n 1\\widehat{\\beta}_0 + X_{21}\\widehat{\\beta}_1 + X_{22}\\widehat{\\beta}_2 + \\dots + X_{2k}\\widehat{\\beta}_k \\\\\n \\vdots \\\\\n 1\\widehat{\\beta}_0 + X_{n1}\\widehat{\\beta}_1 + X_{n2}\\widehat{\\beta}_2 + \\dots + X_{nk}\\widehat{\\beta}_k\n\\end{array}\n\\right],\n$$\nand so the sum of the squared residuals, in this case, becomes\n$$ \nSSR(\\bfbeta) = \\Vert\\mb{Y} - \\mathbb{X}\\bfbeta\\Vert^{2} = (\\mb{Y} - \\mathbb{X}\\bfbeta)'(\\mb{Y} - \\mathbb{X}\\bfbeta),\n$$\nwhere the double vertical lines mean the Euclidean norm of the argument, $\\Vert \\mb{z} \\Vert = \\sqrt{\\sum_{i=1}^n z_i^{2}}$. The OLS minimization problem, then, is \n$$ \n\\bhat = \\argmin_{\\mb{b} \\in \\mathbb{R}^{(k+1)}}\\; \\Vert\\mb{Y} - \\mathbb{X}\\mb{b}\\Vert^{2}\n$$\nFinally, we can write the orthogonality of the covariates and the residuals as\n$$ \n\\mathbb{X}'\\widehat{\\mb{e}} = \\sum_{i=1}^{n} \\X_{i}\\widehat{e}_{i} = 0.\n$$\n\n## Rank, linear independence, and multicollinearity {#sec-rank}\n\nWhen introducing the OLS estimator, we noted that it would exist when $\\sum_{i=1}^n \\X_i\\X_i'$ is positive definite or that there is \"no multicollinearity.\" This assumption is equivalent to saying that the matrix $\\mathbb{X}$ is full column rank, meaning that $\\text{rank}(\\mathbb{X}) = (k+1)$, where $k+1$ is the number of columns of $\\mathbb{X}$. Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and **linear independence** means that $\\mathbb{X}\\mb{b} = 0$ if and only if $\\mb{b}$ is a column vector of 0s. In other words, we have\n$$ \nb_{1}\\mathbb{X}_{1} + b_{2}\\mathbb{X}_{2} + \\cdots + b_{k+1}\\mathbb{X}_{k+1} = 0 \\quad\\iff\\quad b_{1} = b_{2} = \\cdots = b_{k+1} = 0, \n$$\nwhere $\\mathbb{X}_j$ is the $j$th column of $\\mathbb{X}$. Thus, full column rank says that all the columns are linearly independent or that there is no \"multicollinearity.\"\n\nHow could this be violated? Suppose we accidentally included a linear function of one variable so that $\\mathbb{X}_2 = 2\\mathbb{X}_1$. Then we have,\n$$ \n\\begin{aligned}\n \\mathbb{X}\\mb{b} &= b_{1}\\mathbb{X}_{1} + b_{2}2\\mathbb{X}_1+ b_{3}\\mathbb{X}_{3}+ \\cdots + b_{k+1}\\mathbb{X}_{k+1} \\\\\n &= (b_{1} + 2b_{2})\\mathbb{X}_{1} + b_{3}\\mathbb{X}_{3} + \\cdots + b_{k+1}\\mathbb{X}_{k+1}\n\\end{aligned}\n$$\nIn this case, this expression equals 0 when $b_3 = b_4 = \\cdots = b_{k+1} = 0$ and $b_1 = -2b_2$. Thus, the collection of columns is linearly dependent, so we know that the rank of $\\mathbb{X}$ must be less than full column rank (that is, less than $k+1$). Hopefully, it is also clear that if we removed the problematic column $\\mathbb{X}_2$, the resulting matrix would have $k$ linearly independent columns, implying that $\\mathbb{X}$ is rank $k$. \n\nWhy does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that $\\Xmat$ is of full column rank if and only if $\\Xmat'\\Xmat$ is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of $\\Xmat$ being linearly independent means that the inverse $(\\Xmat'\\Xmat)^{-1}$ exists and so does $\\bhat$. Further, this full rank condition also implies that $\\Xmat'\\Xmat = \\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.\n\nWhat are common situations that lead to violations of no multicollinearity? We have seen one above, with one variable being a linear function of another. But this problem can come out in more subtle ways. Suppose that we have a set of dummy variables corresponding to a single categorical variable, like the region of the country. In the US, this might mean we have $X_{i1} = 1$ for units in the West (0 otherwise), $X_{i2} = 1$ for units in the Midwest (0 otherwise), $X_{i3} = 1$ for units in the South (0 otherwise), and $X_{i4} = 1$ for units in the Northeast (0 otherwise). Each unit has to be in one of these four regions, so there is a linear dependence between these variables, \n$$ \nX_{i4} = 1 - X_{i1} - X_{i2} - X_{i3}.\n$$\nThat is, if I know that you are not in the West, Midwest, or South regions, I know that you are in the Northeast. We would get a linear dependence if we tried to include all of these variables in our regression with an intercept. (Note the 1 in the relationship between $X_{i4}$ and the other variables, that's why there will be linear dependence when including a constant.) Thus, we usually omit one dummy variable from each categorical variable. In that case, the coefficients on the remaining dummies are differences in means between that category and the omitted one (perhaps conditional on other variables included, if included). So if we omitted $X_{i4}$, then the coefficient on $X_{i1}$ would be the difference in mean outcomes between units in the West and Northeast regions. \n\nAnother way collinearity can occur is if you include both an intercept term and a variable that does not vary. This issue can often happen if we mistakenly subset our data to, say, the West region but still include the West dummy variable in the regression. \n\nFinally, note that most statistical software packages will \"solve\" the multicollinearity by arbitrarily removing as many linearly dependent covariates as is necessary to achieve full rank. R will show the estimated coefficients as `NA` in those cases. \n\n## OLS coefficients for binary and categorical regressors\n\nSuppose that the covariates include just the intercept and a single binary variable, $\\X_i = (1\\; X_{i})'$, where $X_i \\in \\{0,1\\}$. In this case, the OLS coefficient on $X_i$, $\\widehat{\\beta_{1}}$, is exactly equal to the difference in sample means of $Y_i$ in the $X_i = 1$ group and the $X_i = 0$ group:\n$$ \n\\widehat{\\beta}_{1} = \\frac{\\sum_{i=1}^{n} X_{i}Y_{i}}{\\sum_{i=1}^{n} X_{i}} - \\frac{\\sum_{i=1}^{n} (1 - X_{i})Y_{i}}{\\sum_{i=1}^{n} 1- X_{i}} = \\overline{Y}_{X =1} - \\overline{Y}_{X=0}\n$$\nThis result is not an approximation. It holds exactly for any sample size. \n\nWe can generalize this idea to discrete variables more broadly. Suppose we have our region variables from the last section and include in our covariates a constant and the dummies for the West, Midwest, and South regions. Then the coefficient on the West dummy will be\n$$ \n\\widehat{\\beta}_{\\text{west}} = \\overline{Y}_{\\text{west}} - \\overline{Y}_{\\text{northeast}},\n$$\nwhich is exactly the difference in sample means of $Y_i$ between the West region and units in the \"omitted region,\" the Northeast. \n\nNote that these interpretations only hold when the regression consists solely of the binary variable or the set of categorical dummy variables. These exact relationships fail when other covariates are added to the model. \n\n\n\n## Projection and geometry of least squares\n\nOLS has a very nice geometric interpretation that can add a lot of intuition for various aspects of the method. In this geometric approach, we view $\\mb{Y}$ as an $n$-dimensional vector in $\\mathbb{R}^n$. As we saw above, OLS in matrix form is about finding a linear combination of the covariate matrix $\\Xmat$ closest to this vector in terms of the Euclidean distance (which is just the sum of squares). \n\nLet $\\mathcal{C}(\\Xmat) = \\{\\Xmat\\mb{b} : \\mb{b} \\in \\mathbb{R}^(k+1)\\}$ be the **column space** of the matrix $\\Xmat$. This set is all linear combinations of the columns of $\\Xmat$ or the set of all possible linear predictions we could obtain from $\\Xmat$. Notice that the OLS fitted values, $\\Xmat\\bhat$, are in this column space. If, as we assume, $\\Xmat$ has full column rank of $k+1$, then the column space $\\mathcal{C}(\\Xmat)$ will be a $k+1$-dimensional surface inside of the larger $n$-dimensional space. If $\\Xmat$ has two columns, the column space will be a plane. \n\nAnother interpretation of the OLS estimator is that it finds the linear predictor as the closest point in the column space of $\\Xmat$ to the outcome vector $\\mb{Y}$. This is called the **projection** of $\\mb{Y}$ onto $\\mathcal{C}(\\Xmat)$. @fig-projection shows this projection for a case with $n=3$ and 2 columns in $\\Xmat$. The shaded blue region represents the plane of the column space of $\\Xmat$, and we can see that $\\Xmat\\bhat$ is the closest point to $\\mb{Y}$ in that space. That's the whole idea of the OLS estimator: find the linear combination of the columns of $\\Xmat$ (a point in the column space) that minimizes the Euclidean distance between that point and the outcome vector (the sum of squared residuals).\n\n![Projection of Y on the column space of the covariates.](assets/img/projection-drawing.png){#fig-projection}\n\nThis figure shows that the residual vector, which is the difference between the $\\mb{Y}$ vector and the projection $\\Xmat\\bhat$, is perpendicular or orthogonal to the column space of $\\Xmat$. This orthogonality is a consequence of the residuals being orthogonal to all the columns of $\\Xmat$,\n$$ \n\\Xmat'\\mb{e} = 0,\n$$\nas we established above. Being orthogonal to all the columns means it will also be orthogonal to all linear combinations of the columns. \n\n## Projection and annihilator matrices\n\nNow that we have the idea of projection to the column space of $\\Xmat$, we can define a way to project any vector into that space. The $n\\times n$ **projection matrix,**\n$$\n\\mb{P}_{\\Xmat} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat',\n$$\nprojects a vector into $\\mathcal{C}(\\Xmat)$. In particular, we can see that this gives us the fitted values for $\\mb{Y}$:\n$$ \n\\mb{P}_{\\Xmat}\\mb{Y} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'\\mb{Y} = \\Xmat\\bhat.\n$$\nBecause we sometimes write the linear predictor as $\\widehat{\\mb{Y}} = \\Xmat\\bhat$, the projection matrix is also called the **hat matrix**. With either name, multiplying a vector by $\\mb{P}_{\\Xmat}$ gives the best linear predictor of that vector as a function of $\\Xmat$. Intuitively, any vector that is already a linear combination of the columns of $\\Xmat$ (so is in $\\mathcal{C}(\\Xmat)$) should be unaffected by this projection: the closest point in $\\mathcal{C}(\\Xmat)$ to a point already in $\\mathcal{C}(\\Xmat)$ is itself. We can also see this algebraically for any linear combination $\\Xmat\\mb{c}$,\n$$\n\\mb{P}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'\\Xmat\\mb{c} = \\Xmat\\mb{c},\n$$\nbecause $(\\Xmat'\\Xmat)^{-1} \\Xmat'\\Xmat$ simplifies to the identity matrix. In particular, the projection of $\\Xmat$ onto itself is just itself: $\\mb{P}_{\\Xmat}\\Xmat = \\Xmat$. \n\nThe second matrix related to projection is the **annihilator matrix**, \n$$ \n\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\mb{P}_{\\Xmat},\n$$\nwhich projects any vector into the orthogonal complement to the column space of $\\Xmat$, \n$$\n\\mathcal{C}^{\\perp}(\\Xmat) = \\{\\mb{c} \\in \\mathbb{R}^n\\;:\\; \\Xmat\\mb{c} = 0 \\}.\n$$\nThis matrix is called the annihilator matrix because if you apply it to any linear combination of $\\Xmat$, you get 0:\n$$ \n\\mb{M}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat\\mb{c} - \\mb{P}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat\\mb{c} - \\Xmat\\mb{c} = 0,\n$$\nand in particular, $\\mb{M}_{\\Xmat}\\Xmat = 0$. Why should we care about this matrix? Perhaps a more evocative name might be the **residual maker** since it makes residuals when applied to $\\mb{Y}$,\n$$ \n\\mb{M}_{\\Xmat}\\mb{Y} = (\\mb{I}_{n} - \\mb{P}_{\\Xmat})\\mb{Y} = \\mb{Y} - \\mb{P}_{\\Xmat}\\mb{Y} = \\mb{Y} - \\Xmat\\bhat = \\widehat{\\mb{e}}.\n$$\n\n\n\nThere are several fundamental properties of the projection matrix that are useful: \n\n- $\\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}$ are **idempotent**, which means that when applied to itself, it simply returns itself: $\\mb{P}_{\\Xmat}\\mb{P}_{\\Xmat} = \\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}\\mb{M}_{\\Xmat} = \\mb{M}_{\\Xmat}$. \n\n- $\\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}$ are symmetric $n \\times n$ matrices so that $\\mb{P}_{\\Xmat}' = \\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}' = \\mb{M}_{\\Xmat}$.\n\n- The rank of $\\mb{P}_{\\Xmat}$ is $k+1$ (the number of columns of $\\Xmat$) and the rank of $\\mb{M}_{\\Xmat}$ is $n - k - 1$. \n\nWe can use the projection and annihilator matrices to arrive at an orthogonal decomposition of the outcome vector:\n$$ \n\\mb{Y} = \\Xmat\\bhat + \\widehat{\\mb{e}} = \\mb{P}_{\\Xmat}\\mb{Y} + \\mb{M}_{\\Xmat}\\mb{Y}.\n$$\n \n\n\n::: {.content-hidden}\n\n## Trace of a matrix\n\nRecall that the trace of a $k \\times k$ square matrix, $\\mb{A} = {a_{ij}}$, is sum the sum of the diagonal entries,\n$$\n\\text{trace}(\\mb{A}) = \\sum_{i=1}^{k} a_{ii},\n$$\nso, for example, $\\text{trace}(\\mb{I}_{n}) = n$. A couple of key properties of the trace:\n\n- Trace is linear: $\\text{trace}(k\\mb{A}) = k\\; \\text{trace}(\\mb{a})$ and $\\text{trace}(\\mb{A} + \\mb{B}) = \\text{trace}(\\mb{A}) + \\text{trace}(\\mb{B})$\n- Trace is invariant to multiplication direction: $\\text{trace}(\\mb{AB}) = \\text{trace}(\\mb{BA})$. \n:::\n\n\n## Residual regression\n\nThere are many situations where we can partition the covariates into two groups, and we might wonder if it is possible how to express or calculate the OLS coefficients for just one set of covariates. In particular, let the columns of $\\Xmat$ be partitioned into $[\\Xmat_{1} \\Xmat_{2}]$, so that the linear prediction we are estimating is \n$$ \n\\mb{Y} = \\Xmat_{1}\\bfbeta_{1} + \\Xmat_{2}\\bfbeta_{2} + \\mb{e}, \n$$\nwith estimated coefficients and residuals\n$$ \n\\mb{Y} = \\Xmat_{1}\\bhat_{1} + \\Xmat_{2}\\bhat_{2} + \\widehat{\\mb{e}}.\n$$\n\nWe now document another way to obtain the estimator $\\bhat_1$ from this regression using a technique called **residual regression**, **partitioned regression**, or the **Frisch-Waugh-Lovell theorem**.\n \n::: {.callout-note}\n\n## Residual regression approach \n\nThe residual regression approach is:\n\n1. Use OLS to regress $\\mb{Y}$ on $\\Xmat_2$ and obtain residuals $\\widetilde{\\mb{e}}_2$. \n2. Use OLS to regress each column of $\\Xmat_1$ on $\\Xmat_2$ and obtain residuals $\\widetilde{\\Xmat}_1$.\n3. Use OLS to regress $\\widetilde{\\mb{e}}_{2}$ on $\\widetilde{\\Xmat}_1$. \n\n:::\n\n::: {#thm-fwl}\n\n## Frisch-Waugh-Lovell\n\nThe OLS coefficients from a regression of $\\widetilde{\\mb{e}}_{2}$ on $\\widetilde{\\Xmat}_1$ are equivalent to the coefficients on $\\Xmat_{1}$ from the regression of $\\mb{Y}$ on both $\\Xmat_{1}$ and $\\Xmat_2$. \n\n:::\n\nOne implication of this theorem is that the regression coefficient for a given variable captures the relationship between the residual variation in the outcome and that variable after accounting for the other covariates. In particular, this coefficient focuses on the variation orthogonal to those other covariates. \n\nWhile perhaps unexpected, this result may not appear particularly useful. We can just run the long regression, right? This trick can be handy when $\\Xmat_2$ consists of dummy variables (or \"fixed effects\") for a categorical variable with many categories. For example, suppose $\\Xmat_2$ consists of indicators for the county of residence for a respondent. In that case, that will have over 3,000 columns, meaning that direct calculation of the $\\bhat = (\\bhat_{1}, \\bhat_{2})$ will require inverting a matrix that is bigger than $3,000 \\times 3,000$. Computationally, this process will be very slow. But above, we saw that predictions of an outcome on a categorical variable are just the sample mean within each level of the variable. Thus, in this case, the residuals $\\widetilde{\\mb{e}}_2$ and $\\Xmat_1$ can be computed by demeaning the outcome and $\\Xmat_1$ within levels of the dummies in $\\Xmat_2$, which can be considerably faster computationally. \n\nFinally, there are data visualization reasons to use residual regression. It is often difficult to see if the linear functional form for some covariate is appropriate once you begin to control for other variables. One can check the relationship using this approach with a scatterplot of $\\widetilde{\\mb{e}}_2$ on $\\Xmat_1$ (when it is a single column). \n\n\n## Outliers, leverage points, and influential observations\n\nGiven that OLS finds the coefficients that minimize the sum of the squared residuals, it is helpful to ask how much impact each residual has on that solution. Let $\\bhat_{(-i)}$ be the OLS estimates if we omit unit $i$. Intuitively, **influential observations** should significantly impact the estimated coefficients so that $\\bhat_{(-i)} - \\bhat$ is large in absolute value. \n\nUnder what conditions will we have influential observations? OLS tries to minimize the sum of **squared** residuals, so it will move more to shrink larger residuals than smaller ones. Where are large residuals likely to occur? Well, notice that any OLS regression line with a constant will go through the means of the outcome and the covariates: $\\overline{Y} = \\overline{\\X}\\bhat$. Thus, by definition, this means that when an observation is close to the average of the covariates, $\\overline{\\X}$, it cannot have that much influence because OLS forces the regression line to go through $\\overline{Y}$. Thus, we should look for influential points that have two properties:\n\n1. Have high **leverage**, where leverage roughly measures how far $\\X_i$ is from $\\overline{\\X}$, and\n2. Be an **outlier** in the sense of having a large residual (if left out of the regression).\n\nWe'll take each of these in turn. \n\n### Leverage points {#sec-leverage}\n\nWe can define the **leverage** of an observation by\n$$ \nh_{ii} = \\X_{i}'\\left(\\Xmat'\\Xmat\\right)^{-1}\\X_{i},\n$$\nwhich is the $i$th diagonal entry of the projection matrix, $\\mb{P}_{\\Xmat}$. Notice that \n$$ \n\\widehat{\\mb{Y}} = \\mb{P}_{\\Xmat}\\mb{Y} \\qquad \\implies \\qquad \\widehat{Y}_i = \\sum_{j=1}^n h_{ij}Y_j,\n$$\nso that $h_{ij}$ is the importance of observation $j$ for the fitted value for observation $i$. The leverage, then, is the importance of the observation for its own fitted value. We can also interpret these values in terms of the distribution of $\\X_{i}$. Roughly speaking, these values are the weighted distance $\\X_i$ is from $\\overline{\\X}$, where the weights normalize to the empirical variance/covariance structure of the covariates (so that the scale of each covariate is roughly the same). We can see this most clearly when we fit a simple linear regression (with one covariate and an intercept) with OLS when the leverage is\n$$ \nh_{ii} = \\frac{1}{n} + \\frac{(X_i - \\overline{X})^2}{\\sum_{j=1}^n (X_j - \\overline{X})^2}\n$$\n\nLeverage values have three key properties:\n\n1. $0 \\leq h_{ii} \\leq 1$\n2. $h_{ii} \\geq 1/n$ if the model contains an intercept\n2. $\\sum_{i=1}^{n} h_{ii} = k + 1$\n\n### Outliers and leave-one-out regression\n\nIn the context of OLS, an **outlier** is an observation with a large prediction error for a particular OLS specification. @fig-outlier shows an example of an outlier. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![An example of an outlier.](07_least_squares_files/figure-html/fig-outlier-1.png){#fig-outlier width=672}\n:::\n:::\n\n\nIntuitively, it seems as though we could use the residual $\\widehat{e}_i$ to assess the prediction error for a given unit. But the residuals are not valid predictions because the OLS estimator is designed to make those as small as possible (in machine learning parlance, these were in the training set). In particular, if an outlier is influential, we already noted that it might \"pull\" the regression line toward it, and the resulting residual might be pretty small. \n\nTo assess prediction errors more cleanly, we can use **leave-one-out regression** (LOO), which regresses $\\mb{Y}_{(-i)}$ on $\\Xmat_{(-i)}$, where these omit unit $i$:\n$$ \n\\bhat_{(-i)} = \\left(\\Xmat'_{(-i)}\\Xmat_{(-i)}\\right)^{-1}\\Xmat_{(-i)}\\mb{Y}_{(-i)}.\n$$\nWe can then calculate LOO prediction errors as\n$$ \n\\widetilde{e}_{i} = Y_{i} - \\X_{i}'\\bhat_{(-i)}.\n$$\nCalculating these LOO prediction errors for each unit appears to be computationally costly because it seems as though we have to fit OLS $n$ times. Fortunately, there is a closed-form expression for the LOO coefficients and prediction errors in terms of the original regression, \n$$ \n\\bhat_{(-i)} = \\bhat - \\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i \\qquad \\widetilde{e}_i = \\frac{\\widehat{e}_i}{1 - h_{ii}}.\n$$ {#eq-loo-coefs}\nWe can see from this that the LOO prediction errors will differ from the residuals when the leverage of a unit is high. This makes sense! We said earlier that observations with low leverage would be close to $\\overline{\\X}$, where the outcome values have relatively little impact on the OLS fit (because the regression line must go through $\\overline{Y}$). \n\n### Influence points\n\nAn influence point is an observation that has the power to change the coefficients and fitted values for a particular OLS specification. @fig-influence shows an example of such an influence point. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![An example of an influence point.](07_least_squares_files/figure-html/fig-influence-1.png){#fig-influence width=672}\n:::\n:::\n\n\nOne measure of influence, called DFBETA$_i$, measures how much $i$ changes the estimated coefficient vector\n$$ \n\\bhat - \\bhat_{(-i)} = \\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i,\n$$\nso there is one value for each observation-covariate pair. When divided by the standard error of the estimated coefficients, this is called DFBETA**S** (where the \"S\" is for standardized). These are helpful if we focus on a particular coefficient. \n\n\nWhen we want to summarize how much an observation matters for the fit, we can use a compact measure of the influence of an observation by comparing the fitted value from the entire sample to the fitted value from the leave-one-out regression. Using the DFBETA above, we have\n$$ \n\\widehat{Y}_i - \\X_{i}\\bhat_{(-1)} = \\X_{i}'(\\bhat -\\bhat_{(-1)}) = \\X_{i}'\\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i = h_{ii}\\widetilde{e}_i,\n$$\nso the influence of an observation is its leverage times how much of an outlier it is. This value is sometimes called DFFIT (difference in fit). One transformation of this quantity, **Cook's distance**, standardizes this by the sum of the squared residuals:\n$$ \nD_i = \\frac{n-k-1}{k+1}\\frac{h_{ii}\\widetilde{e}_{i}^{2}}{\\widehat{\\mb{e}}'\\widehat{\\mb{e}}}.\n$$\nVarious rules exist for establishing cutoffs for identifying an observation as \"influential\" based on these metrics, but they tend to be ad hoc. In any case, it's better to focus on the holistic question of \"how much does this observation matter for my substantive interpretation\" rather than the narrow question of a particular threshold. \n\n\nIt's all well and good to find influential points, but what should you do about it? The first thing to check is that the data is not corrupted somehow. Sometimes influence points occur because of a coding or data entry error. If you have control over that coding, you should fix those errors. You may consider removing the observation if the error appears in the data acquired from another source. Still, when writing up your analyses, you should be extremely transparent about this choice. Another approach is to consider a transformation of the dependent or independent variables, like the natural logarithm, that might dampen the effects of outliers. Finally, consider using methods that are robust to outliers. \n", "supporting": [ "07_least_squares_files/figure-html" ], diff --git a/_freeze/07_least_squares/execute-results/tex.json b/_freeze/07_least_squares/execute-results/tex.json index 68eed1f..404caec 100644 --- a/_freeze/07_least_squares/execute-results/tex.json +++ b/_freeze/07_least_squares/execute-results/tex.json @@ -1,7 +1,7 @@ { - "hash": "04505ae479cd88516932111dfa804a74", + "hash": "90f4eadf59de9404c076aba8c58ea089", "result": { - "markdown": "# The mechanics of least squares {#sec-ols-mechanics}\n\nThis chapter explores the most widely used estimator for population linear regressions: **ordinary least squares** (OLS). OLS is a plug-in estimator for the best linear projection (or population linear regression) described in the last chapter. Its popularity is partly due to its ease of interpretation, computational simplicity, and statistical efficiency. \n\nIn this chapter, we focus on motivating the estimator and the mechanical or algebraic properties of the OLS estimator. In the next chapter, we will investigate its statistical assumptions. Textbooks often introduce OLS under an assumption of a linear model for the conditional expectation, but this is unnecessary if we view the inference target as the best linear predictor. We discuss this point more fully in the next chapter. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Relationship between political institutions and economic development from Acemoglu, Johnson, and Robinson (2001).](07_least_squares_files/figure-pdf/fig-ajr-scatter-1.pdf){#fig-ajr-scatter}\n:::\n:::\n\n\n\n\n\n## Deriving the OLS estimator \n\nIn the last chapter on the linear model and the best linear projection, we operated purely in the population, not samples. We derived the population regression coefficients $\\bfbeta$, representing the coefficients on the line of best fit in the population. We now take these as our quantity of interest. \n\n::: {.callout-note}\n## Assumption\n\n\nThe variables $\\{(Y_1, \\X_1), \\ldots, (Y_i,\\X_i), \\ldots, (Y_n, \\X_n)\\}$ are i.i.d. draws from a common distribution $F$.\n\n:::\n\nRecall the population linear coefficients (or best linear predictor coefficients) that we derived in the last chapter,\n$$ \n\\bfbeta = \\argmin_{\\mb{b} \\in \\real^k}\\; \\E\\bigl[ \\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\\bigr] = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}]\n$$\n\nWe will consider two different ways to derive the OLS estimator for these coefficients, both of which are versions of the plug-in principle. The first approach is to use the closed-form representation of the coefficients and replace any expectations with sample means,\n$$ \n\\bhat = \\left(\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\frac{1}{n} \\sum_{i=1}^n \\X_{i}Y_{i} \\right),\n$$\nwhich exists if $\\sum_{i=1}^n \\X_i\\X_i'$ is **positive definite** and thus invertible. We will return to this assumption below. \n\n\nIn a simple bivariate linear projection model $m(X_{i}) = \\beta_0 + \\beta_1X_{i}$, we saw that the population slope was $\\beta_1= \\text{cov}(Y_{i},X_{i})/ \\V[X_{i}]$ and this approach would have our estimator for the slope be the ratio of the sample covariance of $Y_i$ and $X_i$ to the sample variance of $X_i$, or\n$$ \n\\widehat{\\beta}_{1} = \\frac{\\widehat{\\sigma}_{Y,X}}{\\widehat{\\sigma}^{2}_{X}} = \\frac{ \\frac{1}{n-1}\\sum_{i=1}^{n} (Y_{i} - \\overline{Y})(X_{i} - \\overline{X})}{\\frac{1}{n-1} \\sum_{i=1}^{n} (X_{i} - \\Xbar)^{2}}.\n$$\n\nThis plug-in approach is widely applicable and tends to have excellent properties in large samples under iid data. But this approach also hides some of the geometry of the setting. \n\nThe second approach applies the plug-in principle not to the closed-form expression for the coefficients but to the optimization problem itself. We call this the **least squares** estimator because it minimizes the empirical (or sample) squared prediction error,\n$$ \n\\bhat = \\argmin_{\\mb{b} \\in \\real^k}\\; \\frac{1}{n} \\sum_{i=1}^{n}\\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2 = \\argmin_{\\mb{b} \\in \\real^k}\\; SSR(\\mb{b}),\n$$\nwhere,\n$$ \nSSR(\\mb{b}) = \\sum_{i=1}^{n}\\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\n$$\nis the sum of the squared residuals. To distinguish it from other, more complicated least squares estimators, we call this the **ordinary least squares** estimator or OLS. \n\nLet's solve this minimization problem! We can write down the first-order conditions as\n$$ \n0=\\frac{\\partial SSR(\\bhat)}{\\partial \\bfbeta} = 2 \\left(\\sum_{i=1}^{n} \\X_{i}Y_{i}\\right) - 2\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right)\\bhat.\n$$\nWe can rearrange this system of equations to\n$$ \n\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right)\\bhat = \\left(\\sum_{i=1}^{n} \\X_{i}Y_{i}\\right).\n$$\nTo obtain the solution for $\\bhat$, notice that $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is a $(k+1) \\times (k+1)$ matrix and $\\bhat$ and $\\sum_{i=1}^{n} \\X_{i}Y_{i}$ are both $k+1$ length column vectors. If $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is invertible, then we can multiply both sides of this equation by that inverse to arrive at\n$$ \n\\bhat = \\left(\\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\sum_{i=1}^n \\X_{i}Y_{i} \\right),\n$$\nwhich is the same expression as the plug-in estimator (after canceling the $1/n$ terms). To confirm that we have found a minimum, we also need to check the second-order condition, \n$$ \n \\frac{\\partial^{2} SSR(\\bhat)}{\\partial \\bfbeta\\bfbeta'} = 2\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right) > 0.\n$$\nWhat does it mean for a matrix to be \"positive\"? In matrix algebra, this condition means that the matrix $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is **positive definite**, a condition that we discuss in @sec-rank. \n\n\nUsing the plug-in or least squares approaches, we arrive at the same estimator for the best linear predictor/population linear regression coefficients.\n\n::: {#thm-ols}\n\nIf the $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is positive definite, then the ordinary least squares estimator is\n$$\n\\bhat = \\left(\\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\sum_{i=1}^n \\X_{i}Y_{i} \\right).\n$$\n\n:::\n\n\n::: {.callout-note}\n\n## Formula for the OLS slopes\n\nAlmost all regression will contain an intercept term usually represented as a constant 1 in the covariate vector. It is also possible to separate the intercept to arrive at the set of coefficients on the \"real\" covariates:\n$$ \nY_{i} = \\alpha + \\X_{i}'\\bfbeta + \\e_{i}\n$$\nDefined this way, we can write the OLS estimator for the \"slopes\" on $\\X_i$ as the OLS estimator with all variables demeaned\n$$ \n\\bhat = \\left(\\frac{1}{n} \\sum_{i=1}^{n} (\\X_{i} - \\overline{\\X})(\\X_{i} - \\overline{\\X})'\\right) \\left(\\frac{1}{n} \\sum_{i=1}^{n}(\\X_{i} - \\overline{\\X})(Y_{i} - \\overline{Y})\\right)\n$$\nwhich is the inverse of the sample covariance matrix of $\\X_i$ times the sample covariance of $\\X_i$ and $Y_i$. The intercept is \n$$ \n\\widehat{\\alpha} = \\overline{Y} - \\overline{\\X}'\\bhat.\n$$\n\n:::\n\nWhen dealing with actual data, we refer to the prediction errors $\\widehat{e}_{i} = Y_i - \\X_i'\\bhat$ as the **residuals** and the predicted value itself, $\\widehat{Y}_i = \\X_{i}'\\bhat$ is also called the **fitted value**. With the population linear regression, we saw that the projection errors $e_i = Y_i - \\X_i'\\bfbeta$ were mean zero and uncorrelated with the covariates $\\E[\\X_{i}e_{i}] = 0$. The residuals have a similar property with respect to the covariates in the sample:\n$$ \n\\sum_{i=1}^n \\X_i\\widehat{e}_i = 0.\n$$\nThe residuals are *exactly* uncorrelated with the covariates (when the covariates include a constant/intercept term), which is mechanically true of the OLS estimator. \n\n\n@fig-ssr-comp shows how OLS works in the bivariate case. Here we see three possible regression lines and the sum of the squared residuals for each line. OLS aims to find the line that minimizes the function on the right. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Different possible lines and their corresponding sum of squared residuals.](07_least_squares_files/figure-pdf/fig-ssr-comp-1.pdf){#fig-ssr-comp}\n:::\n:::\n\n\n\n## Model fit\n\nWe have learned how to use OLS to obtain an estimate of the best linear predictor, but we may ask how good that prediction is. Does using $\\X_i$ help us predict $Y_i$? To investigate this, we can consider two different prediction errors: those using covariates and those that do not. \n\nWe have already seen the prediction error when using the covariates; it is just the **sum of the squared residuals** \n$$ \nSSR = \\sum_{i=1}^n (Y_i - \\X_{i}'\\bhat)^2.\n$$\nRecall that the best predictor for $Y_i$ without any covariates is simply its sample mean, $\\overline{Y}$ and so the prediction error without covariates is what we call the **total sum of squares**,\n$$ \nTSS = \\sum_{i=1}^n (Y_i - \\overline{Y})^2.\n$$\n@fig-ssr-vs-tss shows the difference between these two types of prediction errors. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Total sum of squares vs. the sum of squared residuals.](07_least_squares_files/figure-pdf/fig-ssr-vs-tss-1.pdf){#fig-ssr-vs-tss}\n:::\n:::\n\n\n\nWe can use the **proportion reduction in prediction error** from adding those covariates to measure how much those covariates improve the regression's predictive ability. This value, called the **coefficient of determination** or $R^2$ is simply\n$$\nR^2 = \\frac{TSS - SSR}{TSS} = 1-\\frac{SSR}{TSS},\n$$\nwhich is the reduction in error moving from $\\overline{Y}$ to $\\X_i'\\bhat$ as the predictor relative to the prediction error using $\\overline{Y}$. We can think of this as the fraction of the total prediction error eliminated by using $\\X_i$ to predict $Y_i$. One thing to note is that OLS will *always* improve in-sample fit so that $TSS \\geq SSR$ even if $\\X_i$ is unrelated to $Y_i$. This phantom improvement occurs because the whole point of OLS is to minimize the SSR, and it will do that even if it is just chasing noise. \n\nSince regression always improves in-sample fit, $R^2$ will fall between 0 and 1. A value 0 zero would indicate exactly 0 estimated coefficients on all covariates (except the intercept) so that $Y_i$ and $\\X_i$ are perfectly orthogonal in the data (this is very unlikely to occur because there will likely be some minimal but nonzero relationship by random chance). A value of 1 indicates a perfect linear fit. \n\n## Matrix form of OLS\n\nWhile we derived the OLS estimator above, there is a much more common representation of the estimator that relies on vectors and matrices. We usually write the linear model for a generic unit, $Y_i = \\X_i'\\bfbeta + e_i$, but obviously, there are $n$ of these equations,\n$$ \n\\begin{aligned}\n Y_1 &= \\X_1'\\bfbeta + e_1 \\\\\n Y_2 &= \\X_2'\\bfbeta + e_2 \\\\\n &\\vdots \\\\\n Y_n &= \\X_n'\\bfbeta + e_n \\\\\n\\end{aligned}\n$$\nWe can write this system of equations in a more compact form using matrix algebra. In particular, let's combine the variables here into random vectors/matrices:\n$$\n\\mb{Y} = \\begin{pmatrix}\nY_1 \\\\ Y_2 \\\\ \\vdots \\\\ Y_n\n \\end{pmatrix}, \\quad\n \\mathbb{X} = \\begin{pmatrix}\n\\X'_1 \\\\\n\\X'_2 \\\\\n\\vdots \\\\\n\\X'_n\n \\end{pmatrix} =\n \\begin{pmatrix}\n 1 & X_{11} & X_{12} & \\cdots & X_{1k} \\\\\n 1 & X_{21} & X_{22} & \\cdots & X_{2k} \\\\\n \\vdots & \\vdots & \\vdots & \\vdots & \\vdots \\\\\n 1 & X_{n1} & X_{n2} & \\cdots & X_{nk} \\\\\n \\end{pmatrix},\n \\quad\n \\mb{e} = \\begin{pmatrix}\ne_1 \\\\ e_2 \\\\ \\vdots \\\\ e_n\n \\end{pmatrix}\n$$\nThen we can write the above system of equations as\n$$\n\\mb{Y} = \\mathbb{X}\\bfbeta + \\mb{e},\n$$\nwhere notice now that $\\mathbb{X}$ is a $n \\times (k+1)$ matrix and $\\bfbeta$ is a $k+1$ length column vector. \n\nA critical link between the definition of OLS above to the matrix notation comes from representing sums in matrix form. In particular, we have\n$$\n\\begin{aligned}\n \\sum_{i=1}^n \\X_i\\X_i' &= \\Xmat'\\Xmat \\\\\n \\sum_{i=1}^n \\X_iY_i &= \\Xmat'\\mb{Y},\n\\end{aligned}\n$$\nwhich means we can write the OLS estimator in the more recognizable form as \n$$ \n\\bhat = \\left( \\mathbb{X}'\\mathbb{X} \\right)^{-1} \\mathbb{X}'\\mb{Y}.\n$$\n\nOf course, we can also define the vector of residuals,\n$$ \n \\widehat{\\mb{e}} = \\mb{Y} - \\mathbb{X}\\bhat = \\left[\n\\begin{array}{c}\n Y_1 \\\\\n Y_2 \\\\\n \\vdots \\\\\n Y_n\n \\end{array}\n\\right] - \n\\left[\n\\begin{array}{c}\n 1\\widehat{\\beta}_0 + X_{11}\\widehat{\\beta}_1 + X_{12}\\widehat{\\beta}_2 + \\dots + X_{1k}\\widehat{\\beta}_k \\\\\n 1\\widehat{\\beta}_0 + X_{21}\\widehat{\\beta}_1 + X_{22}\\widehat{\\beta}_2 + \\dots + X_{2k}\\widehat{\\beta}_k \\\\\n \\vdots \\\\\n 1\\widehat{\\beta}_0 + X_{n1}\\widehat{\\beta}_1 + X_{n2}\\widehat{\\beta}_2 + \\dots + X_{nk}\\widehat{\\beta}_k\n\\end{array}\n\\right],\n$$\nand so the sum of the squared residuals, in this case, becomes\n$$ \nSSR(\\bfbeta) = \\Vert\\mb{Y} - \\mathbb{X}\\bfbeta\\Vert^{2} = (\\mb{Y} - \\mathbb{X}\\bfbeta)'(\\mb{Y} - \\mathbb{X}\\bfbeta),\n$$\nwhere the double vertical lines mean the Euclidean norm of the argument, $\\Vert \\mb{z} \\Vert = \\sqrt{\\sum_{i=1}^n z_i^{2}}$. The OLS minimization problem, then, is \n$$ \n\\bhat = \\argmin_{\\mb{b} \\in \\mathbb{R}^{(k+1)}}\\; \\Vert\\mb{Y} - \\mathbb{X}\\mb{b}\\Vert^{2}\n$$\nFinally, we can write the orthogonality of the covariates and the residuals as\n$$ \n\\mathbb{X}'\\widehat{\\mb{e}} = \\sum_{i=1}^{n} \\X_{i}\\widehat{e}_{i} = 0.\n$$\n\n## Rank, linear independence, and multicollinearity {#sec-rank}\n\nWhen introducing the OLS estimator, we noted that it would exist when $\\sum_{i=1}^n \\X_i\\X_i'$ is positive definite or that there is \"no multicollinearity.\" This assumption is equivalent to saying the matrix $\\mathbb{X}$ is full column rank, meaning that $\\text{rank}(\\mathbb{X}) = (k+1)$, where $k+1$ is the number of columns of $\\mathbb{X}$. Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and **linear independence** means that if $\\mathbb{X}\\mb{b} = 0$ if and only if $\\mb{b}$ is a column vector of 0s. In other words, we have\n$$ \nb_{1}\\mathbb{X}_{1} + b_{2}\\mathbb{X}_{2} + \\cdots + b_{k+1}\\mathbb{X}_{k+1} = 0 \\quad\\iff\\quad b_{1} = b_{2} = \\cdots = b_{k+1} = 0, \n$$\nwhere $\\mathbb{X}_j$ is the $j$th column of $\\mathbb{X}$. Thus, full column rank says that all the columns are linearly independent or that there is no \"multicollinearity.\"\n\nHow could this be violated? Suppose we accidentally included a linear function of one variable so that $\\mathbb{X}_2 = 2\\mathbb{X}_1$. Then we have,\n$$ \n\\begin{aligned}\n \\mathbb{X}\\mb{b} &= b_{1}\\mathbb{X}_{1} + b_{2}2\\mathbb{X}_1+ b_{3}\\mathbb{X}_{3}+ \\cdots + b_{k+1}\\mathbb{X}_{k+1} \\\\\n &= (b_{1} + 2b_{2})\\mathbb{X}_{1} + b_{3}\\mathbb{X}_{3} + \\cdots + b_{k+1}\\mathbb{X}_{k+1}\n\\end{aligned}\n$$\nIn this case, this expression equals 0 when $b_3 = b_4 = \\cdots = b_{k+1} = 0$ and $b_1 = -2b_2$. Thus, the collection of columns is linearly dependent, so we know that the rank of $\\mathbb{X}$ must be less than full column rank (that is, less than $k+1$). Hopefully, it is also clear that if we removed the problematic column $\\mathbb{X}_2$, the resulting matrix would have $k$ linearly independent columns, implying that $\\mathbb{X}$ is rank $k$. \n\nWhy does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that $\\Xmat$ if of full column rank if and only if $\\Xmat'\\Xmat$ is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of $\\Xmat$ being linearly independent means that the inverse $(\\Xmat'\\Xmat)^{-1}$ exists and so does $\\bhat$. Further, this full rank condition also implies that $\\Xmat'\\Xmat = \\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.\n\nWhat are common situations that lead to violations of no multicollinearity? We have seen one above, with one variable being a linear function of another. But this problem can come out in more subtle ways. Suppose that we have a set of dummy variables corresponding to a single categorical variable, like the region of the country. In the US, this might mean we have $X_{i1} = 1$ for units in the West (0 otherwise), $X_{i2} = 1$ for units in the Midwest (0 otherwise), $X_{i3} = 1$ for units in the South (0 otherwise), and $X_{i4} = 1$ for units in the Northeast (0 otherwise). Each unit has to be in one of these four regions, so there is a linear dependence between these variables, \n$$ \nX_{i4} = 1 - X_{i1} - X_{i2} - X_{i3}.\n$$\nThat is, if I know that you are not in the West, Midwest, or South regions, I know that you are in the Northeast. We would get a linear dependence if we tried to include all of these variables in our regression with an intercept. (Note the 1 in the relationship between $X_{i4}$ and the other variables, that's why there will be linear dependence when including a constant.) Thus, we usually omit one dummy variable from each categorical variable. In that case, the coefficients on the remaining dummies are differences in means between that category and the omitted one (perhaps conditional on other variables included, if included). So if we omitted $X_{i4}$, then the coefficient on $X_{i1}$ would be the difference in mean outcomes between units in the West and Northeast regions. \n\nAnother way collinearity can occur is if you include both an intercept term and a variable that does not vary. This issue can often happen if we mistakenly subset our data to, say, the West region but still include the West dummy variable in the regression. \n\nFinally, note that most statistical software packages will \"solve\" the multicollinearity by arbitrarily removing as many linearly dependent covariates as is necessary to achieve full rank. R will show the estimated coefficients as `NA` in those cases. \n\n## OLS coefficients for binary and categorical regressors\n\nSuppose that the covariates include just the intercept and a single binary variable, $\\X_i = (1\\; X_{i})'$, where $X_i \\in \\{0,1\\}$. In this case, the OLS coefficient on $X_i$, $\\widehat{\\beta_{1}}$, is exactly equal to the difference in sample means of $Y_i$ in the $X_i = 1$ group and the $X_i = 0$ group:\n$$ \n\\widehat{\\beta}_{1} = \\frac{\\sum_{i=1}^{n} X_{i}Y_{i}}{\\sum_{i=1}^{n} X_{i}} - \\frac{\\sum_{i=1}^{n} (1 - X_{i})Y_{i}}{\\sum_{i=1}^{n} 1- X_{i}} = \\overline{Y}_{X =1} - \\overline{Y}_{X=0}\n$$\nThis result is not an approximation. It holds exactly for any sample size. \n\nWe can generalize this idea to discrete variables more broadly. Suppose we have our region variables from the last section and include in our covariates a constant and the dummies for the West, Midwest, and South regions. Then coefficient on the West dummy will be\n$$ \n\\widehat{\\beta}_{\\text{west}} = \\overline{Y}_{\\text{west}} - \\overline{Y}_{\\text{northeast}},\n$$\nwhich is exactly the difference in sample means of $Y_i$ between the West region and units in the \"omitted region,\" the Northeast. \n\nNote that these interpretations only hold when the regression consists solely of the binary variable or the set of categorical dummy variables. These exact relationships fail when other covariates are added to the model. \n\n\n\n## Projection and geometry of least squares\n\nOLS has a very nice geometric interpretation that can add a lot of intuition for various aspects of the method. In this geometric approach, we view $\\mb{Y}$ as an $n$-dimensional vector in $\\mathbb{R}^n$. As we saw above, OLS in matrix form is about finding a linear combination of the covariate matrix $\\Xmat$ closest to this vector in terms of the Euclidean distance (which is just the sum of squares). \n\nLet $\\mathcal{C}(\\Xmat) = \\{\\Xmat\\mb{b} : \\mb{b} \\in \\mathbb{R}^2\\}$ be the **column space** of the matrix $\\Xmat$. This set is all linear combinations of the columns of $\\Xmat$ or the set of all possible linear predictions we could obtain from $\\Xmat$. Notice that the OLS fitted values, $\\Xmat\\bhat$, is in this column space. If, as we assume, $\\Xmat$ has full column rank of $k+1$, then the column space $\\mathcal{C}(\\Xmat)$ will be a $k+1$-dimensional surface inside of the larger $n$-dimensional space. If $\\Xmat$ has two columns, the column space will be a plane. \n\nAnother interpretation of the OLS estimator is that it finds the linear predictor as the closest point in the column space of $\\Xmat$ to the outcome vector $\\mb{Y}$. This is called the **projection** of $\\mb{Y}$ onto $\\mathcal{C}(\\Xmat)$. @fig-projection shows this projection for a case with $n=3$ and 2 columns in $\\Xmat$. The shaded blue region represents the plane of the column space of $\\Xmat$, and we can see that $\\Xmat\\bhat$ is the closest point to $\\mb{Y}$ in that space. That's the whole idea of the OLS estimator: find the linear combination of the columns of $\\Xmat$ (a point in the column space) that minimizes the Euclidean distance between that point and the outcome vector (the sum of squared residuals).\n\n![Projection of Y on the column space of the covariates.](assets/img/projection-drawing.png){#fig-projection}\n\nThis figure shows that the residual vector, which is the difference between the $\\mb{Y}$ vector and the projection $\\Xmat\\bhat$ is perpendicular or orthogonal to the column space of $\\Xmat$. This orthogonality is a consequence of the residuals being orthogonal to all the columns of $\\Xmat$,\n$$ \n\\Xmat'\\mb{e} = 0,\n$$\nas we established above. Being orthogonal to all the columns means it will also be orthogonal to all linear combinations of the columns. \n\n## Projection and annihilator matrices\n\nNow that we have the idea of projection to the column space of $\\Xmat$, we can define a way to project any vector into that space. The $n\\times n$ **projection matrix**\n$$\n\\mb{P}_{\\Xmat} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat',\n$$\nprojects a vector into $\\mathcal{C}(\\Xmat)$. In particular, we can see that this gives us the fitted values for $\\mb{Y}$:\n$$ \n\\mb{P}_{\\Xmat}\\mb{Y} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'\\mb{Y} = \\Xmat\\bhat.\n$$\nBecause we sometimes write the linear predictor as $\\widehat{\\mb{Y}} = \\Xmat\\bhat$, the projection matrix is also called the **hat matrix**. With either name, multiplying a vector by $\\mb{P}_{\\Xmat}$ gives the best linear predictor of that vector as a function of $\\Xmat$. Intuitively, any vector that is already a linear combination of the columns of $\\Xmat$ (so is in $\\mathcal{C}(\\Xmat)$) should be unaffected by this projection: the closest point in $\\mathcal{C}(\\Xmat)$ to a point already in $\\mathcal{C}(\\Xmat)$ is itself. We can also see this algebraically for any linear combination $\\Xmat\\mb{c}$\n$$\n\\mb{P}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'\\Xmat\\mb{c} = \\Xmat\\mb{c}\n$$\nbecause $(\\Xmat'\\Xmat)^{-1} \\Xmat'\\Xmat$ simplifies to the identity matrix. In particular, the projection of $\\Xmat$ onto itself is just itself: $\\mb{P}_{\\Xmat}\\Xmat = \\Xmat$. \n\nThe second matrix related to projection is the **annihilator matrix**, \n$$ \n\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\mb{P}_{\\Xmat},\n$$\nwhich projects any vector into the orthogonal complement to the column space of $\\Xmat$, \n$$\n\\mathcal{C}^{\\perp}(\\Xmat) = \\{\\mb{c} \\in \\mathbb{R}^n\\;:\\; \\Xmat\\mb{c} = 0 \\},\n$$\nThis matrix is called the annihilator matrix because if you apply it to any linear combination of $\\Xmat$, you get 0:\n$$ \n\\mb{M}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat\\mb{c} - \\mb{P}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat\\mb{c} - \\Xmat\\mb{c} = 0,\n$$\nand in particular, $\\mb{M}_{\\Xmat}\\Xmat = 0$. Why should we care about this matrix? Perhaps a more evocative name might be the **residual maker** since it makes residuals when applied to $\\mb{Y}$,\n$$ \n\\mb{M}_{\\Xmat}\\mb{Y} = (\\mb{I}_{n} - \\mb{P}_{\\Xmat})\\mb{Y} = \\mb{Y} - \\mb{P}_{\\Xmat}\\mb{Y} = \\mb{Y} - \\Xmat\\bhat = \\widehat{\\mb{e}}.\n$$\n\n\n\nThere are several fundamental property properties of the projection matrix that are useful: \n\n- $\\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}$ are **idempotent**, which means that when applied to itself, it simply returns itself: $\\mb{P}_{\\Xmat}\\mb{P}_{\\Xmat} = \\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}\\mb{M}_{\\Xmat} = \\mb{M}_{\\Xmat}$. \n\n- $\\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}$ are symmetric $n \\times n$ matrices so that $\\mb{P}_{\\Xmat}' = \\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}' = \\mb{M}_{\\Xmat}$.\n\n- The rank of $\\mb{P}_{\\Xmat}$ is $k+1$ (the number of columns of $\\Xmat$) and the rank of $\\mb{M}_{\\Xmat}$ is $n - k - 1$. \n\nWe can use the projection and annihilator matrices to arrive at an orthogonal decomposition of the outcome vector:\n$$ \n\\mb{Y} = \\Xmat\\bhat + \\widehat{\\mb{e}} = \\mb{P}_{\\Xmat}\\mb{Y} + \\mb{M}_{\\Xmat}\\mb{Y}.\n$$\n \n\n\n::: {.content-hidden}\n\n## Trace of a matrix\n\nRecall that the trace of a $k \\times k$ square matrix, $\\mb{A} = {a_{ij}}$, is sum the sum of the diagonal entries,\n$$\n\\text{trace}(\\mb{A}) = \\sum_{i=1}^{k} a_{ii},\n$$\nso, for example, $\\text{trace}(\\mb{I}_{n}) = n$. A couple of key properties of the trace:\n\n- Trace is linear: $\\text{trace}(k\\mb{A}) = k\\; \\text{trace}(\\mb{a})$ and $\\text{trace}(\\mb{A} + \\mb{B}) = \\text{trace}(\\mb{A}) + \\text{trace}(\\mb{B})$\n- Trace is invariant to multiplication direction: $\\text{trace}(\\mb{AB}) = \\text{trace}(\\mb{BA})$. \n:::\n\n\n## Residual regression\n\nThere are many situations where we can partition the covariates into two groups, and we might wonder if it is possible how to express or calculate the OLS coefficients for just one set of covariates. In particular, let the columns of $\\Xmat$ be partitioned into $[\\Xmat_{1} \\Xmat_{2}]$, so that linear prediction we are estimating is \n$$ \n\\mb{Y} = \\Xmat_{1}\\bfbeta_{1} + \\Xmat_{2}\\bfbeta_{2} + \\mb{e}, \n$$\nwith estimated coefficients and residuals\n$$ \n\\mb{Y} = \\Xmat_{1}\\bhat_{1} + \\Xmat_{2}\\bhat_{2} + \\widehat{\\mb{e}}.\n$$\n\nWe now document another way to obtain the estimator $\\bhat_1$ from this regression using a technique called **residual regression**, **partitioned regression**, or the **Frisch-Waugh-Lovell theorem**.\n \n::: {.callout-note}\n\n## Residual regression approach \n\nThe residual regression approach is:\n\n1. Use OLS to regress $\\mb{Y}$ on $\\Xmat_2$ and obtain residuals $\\widetilde{\\mb{e}}_2$. \n2. Use OLS to regress each column of $\\Xmat_1$ on $\\Xmat_2$ and obtain residuals $\\widetilde{\\Xmat}_1$.\n3. Use OLS to regression $\\widetilde{\\mb{e}}_{2}$ on $\\widetilde{\\Xmat}_1$. \n\n:::\n\n::: {#thm-fwl}\n\n## Frisch-Waugh-Lovell\n\nThe OLS coefficients from a regression of $\\widetilde{\\mb{e}}_{2}$ on $\\widetilde{\\Xmat}_1$ are equivalent to the coefficients on $\\Xmat_{1}$ from the regression of $\\mb{Y}$ on both $\\Xmat_{1}$ and $\\Xmat_2$. \n\n:::\n\nOne implication of this theorem is the regression coefficient for a given variable captures the relationship between the residual variation in the outcome and that variable after accounting for the other covariates. In particular, this coefficient focuses on the variation orthogonal to those other covariates. \n\nWhile perhaps unexpected, this result may not appear particularly useful. We can just run the long regression, right? This trick can be handy when $\\Xmat_2$ consists of dummy variables (or \"fixed effects\") for a categorical variable with many categories. For example, suppose $\\Xmat_2$ consists of indicators for the county of residence for a respondent. In that case, that will have over 3,000 columns, meaning that direct calculation of the $\\bhat = (\\bhat_{1}, \\bhat_{2})$ will require inverting a matrix that is bigger than $3,000 \\times 3,000$. Computationally, this process will be very slow. But above, we saw that predictions of an outcome on a categorical variable are just the sample mean within each level of the variable. Thus, in this case, the residuals $\\widetilde{\\mb{e}}_2$ and $\\Xmat_1$ can be computed by demeaning the outcome and $\\Xmat_1$ within levels of the dummies in $\\Xmat_2$, which can be considerably faster computationally. \n\nFinally, there are data visualization reasons to use residual regression. It is often difficult to see if the linear functional form for some covariate is appropriate once you begin to control for other variables. One can check the relationship using this approach with a scatterplot of $\\widetilde{\\mb{e}}_2$ on $\\Xmat_1$ (when it is a single column). \n\n\n## Outliers, leverage points, and influential observations\n\nGiven that OLS finds the coefficients that minimize the sum of the squared residuals, it is helpful to ask how much impact each residual has on that solution. Let $\\bhat_{(-i)}$ be the OLS estimates if we omit unit $i$. Intuitively, **influential observations** should significantly impact the estimated coefficients so that $\\bhat_{(-i)} - \\bhat$ is large in absolute value. \n\nUnder what conditions will we have influential observations? OLS tries to minimize the sum of **squared** residuals, so it will move more to shrink larger residuals than smaller ones. Where are large residuals likely to occur? Well, notice that any OLS regression line with a constant will go through the means of the outcome and the covariates: $\\overline{Y} = \\overline{\\X}\\bhat$. Thus, by definition, This means that when an observation is close to the average of the covariates, $\\overline{\\X}$, it cannot have that much influence because OLS forces the regression line to go through $\\overline{Y}$. Thus, we should look for influential points that have two properties:\n\n1. Have high **leverage**, where leverage roughly measures how far $\\X_i$ is from $\\overline{\\X}$, and\n2. Be an **outlier** in the sense of having a large residual (if left out of the regression).\n\nWe'll take each of these in turn. \n\n### Leverage points {#sec-leverage}\n\nWe can define the **leverage** of an observation by\n$$ \nh_{ii} = \\X_{i}'\\left(\\Xmat'\\Xmat\\right)^{-1}\\X_{i},\n$$\nwhich is the $i$th diagonal entry of the projection matrix, $\\mb{P}_{\\Xmat}$. Notice that \n$$ \n\\widehat{\\mb{Y}} = \\mb{P}\\mb{Y} \\qquad \\implies \\qquad \\widehat{Y}_i = \\sum_{j=1}^n h_{ij}Y_j,\n$$\nso that $h_{ij}$ is the importance of observation $j$ for the fitted value for observation $i$. The leverage, then, is the importance of the observation for its own fitted value. We can also interpret these values in terms of the distribution of $\\X_{i}$. Roughly speaking, these values are the weighted distance $\\X_i$ is from $\\overline{\\X}$, where the weights normalize to the empirical variance/covariance structure of the covariates (so that the scale of each covariate is roughly the same). We can see this most clearly when we fit a simple linear regression (with one covariate and an intercept) with OLS when the leverage is\n$$ \nh_{ii} = \\frac{1}{n} + \\frac{(X_i - \\overline{X})^2}{\\sum_{j=1}^n (X_j - \\overline{X})^2}\n$$\n\nLeverage values have three key properties:\n\n1. $0 \\leq h_{ii} \\leq 1$\n2. $h_{ii} \\geq 1/n$ if the model contains an intercept\n2. $\\sum_{i=1}^{n} h_{ii} = k + 1$\n\n### Outliers and leave-one-out regression\n\nIn the context of OLS, an **outlier** is an observation with a large prediction error for a particular OLS specification. @fig-outlier shows an example of an outlier. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An example of an outlier.](07_least_squares_files/figure-pdf/fig-outlier-1.pdf){#fig-outlier}\n:::\n:::\n\n\n\nIntuitively, it seems as though we could use the residual $\\widehat{e}_i$ to assess the prediction error for a given unit. But the residuals are not valid predictions because the OLS estimator is designed to make those as small as possible (in machine learning parlance, these were in the training set). In particular, if an outlier is influential, we already noted that it might \"pull\" the regression line toward it, and the resulting residual might be pretty small. \n\nTo assess prediction errors more cleanly, we can use **leave-one-out regression** (LOO), which regresses$\\mb{Y}_{(-i)}$ on $\\Xmat_{(-i)}$, where these omit unit $i$:\n$$ \n\\bhat_{(-i)} = \\left(\\Xmat'_{(-i)}\\Xmat_{(-i)}\\right)^{-1}\\Xmat_{(-i)}\\mb{Y}_{(-i)}.\n$$\nWe can then calculate LOO prediction errors as\n$$ \n\\widetilde{e}_{i} = Y_{i} - \\X_{i}'\\bhat_{(-i)}.\n$$\nCalculating these LOO prediction errors for each unit appears to be computationally costly because it seems as though we have to fit OLS $n$ times. Fortunately, there is a closed-form expression for the LOO coefficients and prediction errors in terms of the original regression, \n$$ \n\\bhat_{(-i)} = \\bhat - \\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i \\qquad \\widetilde{e}_i = \\frac{\\widehat{e}_i}{1 - h_{ii}}.\n$$ {#eq-loo-coefs}\nWe can see from this that the LOO prediction errors will differ from the residuals when the leverage of a unit is high. This makes sense! We said earlier that observations with low leverage would be close to $\\overline{\\X}$, where the outcome values have relatively little impact on the OLS fit (because the regression line must go through $\\overline{Y}$). \n\n### Influence points\n\nAn influence point is an observation that has the power to change the coefficients and fitted values for a particular OLS specification. @fig-influence shows an example of such an influence point. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An example of an influence point.](07_least_squares_files/figure-pdf/fig-influence-1.pdf){#fig-influence}\n:::\n:::\n\n\n\nOne measure of influence is called DFBETA$_i$ measures how much $i$ changes the estimated coefficient vector\n$$ \n\\bhat - \\bhat_{(-i)} = \\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i,\n$$\nso there is one value for each observation-covariate pair. When divided by the standard error of the estimated coefficients, this is called DFBETA**S** (where the \"S\" is for standardized). These are helpful if we focus on a particular coefficient. \n\n\nWhen we want to summarize how much an observation matters for the fit, we can use a compact measure of the influence of an observation by comparing the fitted value from the entire sample to the fitted value from the leave-one-out regression. Using the DFBETA above, we have\n$$ \n\\widehat{Y}_i - \\X_{i}\\bhat_{(-1)} = \\X_{i}'(\\bhat -\\bhat_{(-1)}) = \\X_{i}'\\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i = h_{ii}\\widetilde{e}_i,\n$$\nso the influence of an observation is its leverage times how much of an outlier it is. This value is sometimes called DFFIT (difference in fit). One transformation of this quantity, **Cook's distance**, standardizes this by the sum of the squared residuals:\n$$ \nD_i = \\frac{n-k-1}{k+1}\\frac{h_{ii}\\widetilde{e}_{i}^{2}}{\\widehat{\\mb{e}}'\\widehat{\\mb{e}}}.\n$$\nVarious rules exist for establishing cutoffs for identifying an observation as \"influential\" based on these metrics, but they tend to be ad hoc. In any case, it's better to focus on the holistic question of \"how much does this observation matter for my substantive interpretation\" rather than the narrow question of a particular threshold. \n\n\nIt's all well and good to find influential points, but what should you do about it? The first thing to check is that the data is not corrupted somehow. Sometimes influence points occur because of a coding or data entry error. If you have control over that coding, you should fix those errors. You may consider removing the observation if the error appears in the data acquired from another source. Still, when writing up your analyses, you should be extremely clear about this choice. Another approach is to consider a transformation of the dependent or independent variables, like the natural logarithm, that might dampen the effects of outliers. Finally, consider using methods that are robust to outliers. \n", + "markdown": "# The mechanics of least squares {#sec-ols-mechanics}\n\nThis chapter explores the most widely used estimator for population linear regressions: **ordinary least squares** (OLS). OLS is a plug-in estimator for the best linear projection (or population linear regression) described in the last chapter. Its popularity is partly due to its ease of interpretation, computational simplicity, and statistical efficiency. \n\nIn this chapter, we focus on motivating the estimator and the mechanical or algebraic properties of the OLS estimator. In the next chapter, we will investigate its statistical assumptions. Textbooks often introduce OLS under an assumption of a linear model for the conditional expectation, but this is unnecessary if we view the inference target as the best linear predictor. We discuss this point more fully in the next chapter. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Relationship between political institutions and economic development from Acemoglu, Johnson, and Robinson (2001).](07_least_squares_files/figure-pdf/fig-ajr-scatter-1.pdf){#fig-ajr-scatter}\n:::\n:::\n\n\n\n\n\n## Deriving the OLS estimator \n\nIn the last chapter on the linear model and the best linear projection, we operated purely in the population, not samples. We derived the population regression coefficients $\\bfbeta$, representing the coefficients on the line of best fit in the population. We now take these as our quantity of interest. \n\n::: {.callout-note}\n## Assumption\n\n\nThe variables $\\{(Y_1, \\X_1), \\ldots, (Y_i,\\X_i), \\ldots, (Y_n, \\X_n)\\}$ are i.i.d. draws from a common distribution $F$.\n\n:::\n\nRecall the population linear coefficients (or best linear predictor coefficients) that we derived in the last chapter,\n$$ \n\\bfbeta = \\argmin_{\\mb{b} \\in \\real^k}\\; \\E\\bigl[ \\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\\bigr] = \\left(\\E[\\X_{i}\\X_{i}']\\right)^{-1}\\E[\\X_{i}Y_{i}]\n$$\n\nWe will consider two different ways to derive the OLS estimator for these coefficients, both of which are versions of the plug-in principle. The first approach is to use the closed-form representation of the coefficients and replace any expectations with sample means,\n$$ \n\\bhat = \\left(\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\frac{1}{n} \\sum_{i=1}^n \\X_{i}Y_{i} \\right),\n$$\nwhich exists if $\\sum_{i=1}^n \\X_i\\X_i'$ is **positive definite** and thus invertible. We will return to this assumption below. \n\n\nIn a simple bivariate linear projection model $m(X_{i}) = \\beta_0 + \\beta_1X_{i}$, we saw that the population slope was $\\beta_1= \\text{cov}(Y_{i},X_{i})/ \\V[X_{i}]$ and this approach would have our estimator for the slope be the ratio of the sample covariance of $Y_i$ and $X_i$ to the sample variance of $X_i$, or\n$$ \n\\widehat{\\beta}_{1} = \\frac{\\widehat{\\sigma}_{Y,X}}{\\widehat{\\sigma}^{2}_{X}} = \\frac{ \\frac{1}{n-1}\\sum_{i=1}^{n} (Y_{i} - \\overline{Y})(X_{i} - \\overline{X})}{\\frac{1}{n-1} \\sum_{i=1}^{n} (X_{i} - \\Xbar)^{2}}.\n$$\n\nThis plug-in approach is widely applicable and tends to have excellent properties in large samples under iid data. But this approach also hides some of the geometry of the setting. \n\nThe second approach applies the plug-in principle not to the closed-form expression for the coefficients but to the optimization problem itself. We call this the **least squares** estimator because it minimizes the empirical (or sample) squared prediction error,\n$$ \n\\bhat = \\argmin_{\\mb{b} \\in \\real^k}\\; \\frac{1}{n} \\sum_{i=1}^{n}\\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2 = \\argmin_{\\mb{b} \\in \\real^k}\\; SSR(\\mb{b}),\n$$\nwhere,\n$$ \nSSR(\\mb{b}) = \\sum_{i=1}^{n}\\bigl(Y_{i} - \\mb{X}_{i}'\\mb{b} \\bigr)^2\n$$\nis the sum of the squared residuals. To distinguish it from other, more complicated least squares estimators, we call this the **ordinary least squares** estimator or OLS. \n\nLet's solve this minimization problem! We can write down the first-order conditions as\n$$ \n0=\\frac{\\partial SSR(\\bhat)}{\\partial \\bfbeta} = 2 \\left(\\sum_{i=1}^{n} \\X_{i}Y_{i}\\right) - 2\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right)\\bhat.\n$$\nWe can rearrange this system of equations to\n$$ \n\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right)\\bhat = \\left(\\sum_{i=1}^{n} \\X_{i}Y_{i}\\right).\n$$\nTo obtain the solution for $\\bhat$, notice that $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is a $(k+1) \\times (k+1)$ matrix and $\\bhat$ and $\\sum_{i=1}^{n} \\X_{i}Y_{i}$ are both $k+1$ length column vectors. If $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is invertible, then we can multiply both sides of this equation by that inverse to arrive at\n$$ \n\\bhat = \\left(\\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\sum_{i=1}^n \\X_{i}Y_{i} \\right),\n$$\nwhich is the same expression as the plug-in estimator (after canceling the $1/n$ terms). To confirm that we have found a minimum, we also need to check the second-order condition, \n$$ \n \\frac{\\partial^{2} SSR(\\bhat)}{\\partial \\bfbeta\\bfbeta'} = 2\\left(\\sum_{i=1}^{n}\\X_{i}\\X_{i}'\\right) > 0.\n$$\nWhat does it mean for a matrix to be \"positive\"? In matrix algebra, this condition means that the matrix $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is **positive definite**, a condition that we discuss in @sec-rank. \n\n\nUsing the plug-in or least squares approaches, we arrive at the same estimator for the best linear predictor/population linear regression coefficients.\n\n::: {#thm-ols}\n\nIf the $\\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is positive definite, then the ordinary least squares estimator is\n$$\n\\bhat = \\left(\\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left(\\sum_{i=1}^n \\X_{i}Y_{i} \\right).\n$$\n\n:::\n\n\n::: {.callout-note}\n\n## Formula for the OLS slopes\n\nAlmost all regression will contain an intercept term usually represented as a constant 1 in the covariate vector. It is also possible to separate the intercept to arrive at the set of coefficients on the \"real\" covariates:\n$$ \nY_{i} = \\alpha + \\X_{i}'\\bfbeta + \\e_{i}.\n$$\nDefined this way, we can write the OLS estimator for the \"slopes\" on $\\X_i$ as the OLS estimator with all variables demeaned\n$$ \n\\bhat = \\left(\\frac{1}{n} \\sum_{i=1}^{n} (\\X_{i} - \\overline{\\X})(\\X_{i} - \\overline{\\X})'\\right) \\left(\\frac{1}{n} \\sum_{i=1}^{n}(\\X_{i} - \\overline{\\X})(Y_{i} - \\overline{Y})\\right)\n$$\nwhich is the inverse of the sample covariance matrix of $\\X_i$ times the sample covariance of $\\X_i$ and $Y_i$. The intercept is \n$$ \n\\widehat{\\alpha} = \\overline{Y} - \\overline{\\X}'\\bhat.\n$$\n\n:::\n\nWhen dealing with actual data, we refer to the prediction errors $\\widehat{e}_{i} = Y_i - \\X_i'\\bhat$ as the **residuals** and the predicted value itself, $\\widehat{Y}_i = \\X_{i}'\\bhat$, is also called the **fitted value**. With the population linear regression, we saw that the projection errors $e_i = Y_i - \\X_i'\\bfbeta$ were mean zero and uncorrelated with the covariates $\\E[\\X_{i}e_{i}] = 0$. The residuals have a similar property with respect to the covariates in the sample:\n$$ \n\\sum_{i=1}^n \\X_i\\widehat{e}_i = 0.\n$$\nThe residuals are *exactly* uncorrelated with the covariates (when the covariates include a constant/intercept term), which is mechanically true of the OLS estimator. \n\n\n@fig-ssr-comp shows how OLS works in the bivariate case. Here we see three possible regression lines and the sum of the squared residuals for each line. OLS aims to find the line that minimizes the function on the right. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Different possible lines and their corresponding sum of squared residuals.](07_least_squares_files/figure-pdf/fig-ssr-comp-1.pdf){#fig-ssr-comp}\n:::\n:::\n\n\n\n## Model fit\n\nWe have learned how to use OLS to obtain an estimate of the best linear predictor, but we may ask how good that prediction is. Does using $\\X_i$ help us predict $Y_i$? To investigate this, we can consider two different prediction errors: those using covariates and those that do not. \n\nWe have already seen the prediction error when using the covariates; it is just the **sum of the squared residuals**,\n$$ \nSSR = \\sum_{i=1}^n (Y_i - \\X_{i}'\\bhat)^2.\n$$\nRecall that the best predictor for $Y_i$ without any covariates is simply its sample mean $\\overline{Y}$, and so the prediction error without covariates is what we call the **total sum of squares**,\n$$ \nTSS = \\sum_{i=1}^n (Y_i - \\overline{Y})^2.\n$$\n@fig-ssr-vs-tss shows the difference between these two types of prediction errors. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Total sum of squares vs. the sum of squared residuals.](07_least_squares_files/figure-pdf/fig-ssr-vs-tss-1.pdf){#fig-ssr-vs-tss}\n:::\n:::\n\n\n\nWe can use the **proportion reduction in prediction error** from adding those covariates to measure how much those covariates improve the regression's predictive ability. This value, called the **coefficient of determination** or $R^2$, is simply\n$$\nR^2 = \\frac{TSS - SSR}{TSS} = 1-\\frac{SSR}{TSS},\n$$\nwhich is the reduction in error moving from $\\overline{Y}$ to $\\X_i'\\bhat$ as the predictor relative to the prediction error using $\\overline{Y}$. We can think of this as the fraction of the total prediction error eliminated by using $\\X_i$ to predict $Y_i$. One thing to note is that OLS will *always* improve in-sample fit so that $TSS \\geq SSR$ even if $\\X_i$ is unrelated to $Y_i$. This phantom improvement occurs because the whole point of OLS is to minimize the SSR, and it will do that even if it is just chasing noise. \n\nSince regression always improves in-sample fit, $R^2$ will fall between 0 and 1. A value 0 zero would indicate exactly 0 estimated coefficients on all covariates (except the intercept) so that $Y_i$ and $\\X_i$ are perfectly orthogonal in the data (this is very unlikely to occur because there will likely be some minimal but nonzero relationship by random chance). A value of 1 indicates a perfect linear fit. \n\n## Matrix form of OLS\n\nWhile we derived the OLS estimator above, there is a much more common representation of the estimator that relies on vectors and matrices. We usually write the linear model for a generic unit, $Y_i = \\X_i'\\bfbeta + e_i$, but obviously, there are $n$ of these equations,\n$$ \n\\begin{aligned}\n Y_1 &= \\X_1'\\bfbeta + e_1 \\\\\n Y_2 &= \\X_2'\\bfbeta + e_2 \\\\\n &\\vdots \\\\\n Y_n &= \\X_n'\\bfbeta + e_n \\\\\n\\end{aligned}\n$$\nWe can write this system of equations in a more compact form using matrix algebra. In particular, let's combine the variables here into random vectors/matrices:\n$$\n\\mb{Y} = \\begin{pmatrix}\nY_1 \\\\ Y_2 \\\\ \\vdots \\\\ Y_n\n \\end{pmatrix}, \\quad\n \\mathbb{X} = \\begin{pmatrix}\n\\X'_1 \\\\\n\\X'_2 \\\\\n\\vdots \\\\\n\\X'_n\n \\end{pmatrix} =\n \\begin{pmatrix}\n 1 & X_{11} & X_{12} & \\cdots & X_{1k} \\\\\n 1 & X_{21} & X_{22} & \\cdots & X_{2k} \\\\\n \\vdots & \\vdots & \\vdots & \\vdots & \\vdots \\\\\n 1 & X_{n1} & X_{n2} & \\cdots & X_{nk} \\\\\n \\end{pmatrix},\n \\quad\n \\mb{e} = \\begin{pmatrix}\ne_1 \\\\ e_2 \\\\ \\vdots \\\\ e_n\n \\end{pmatrix}\n$$\nThen we can write the above system of equations as\n$$\n\\mb{Y} = \\mathbb{X}\\bfbeta + \\mb{e},\n$$\nwhere notice now that $\\mathbb{X}$ is an $n \\times (k+1)$ matrix and $\\bfbeta$ is a $k+1$ length column vector. \n\nA critical link between the definition of OLS above to the matrix notation comes from representing sums in matrix form. In particular, we have\n$$\n\\begin{aligned}\n \\sum_{i=1}^n \\X_i\\X_i' &= \\Xmat'\\Xmat \\\\\n \\sum_{i=1}^n \\X_iY_i &= \\Xmat'\\mb{Y},\n\\end{aligned}\n$$\nwhich means we can write the OLS estimator in the more recognizable form as \n$$ \n\\bhat = \\left( \\mathbb{X}'\\mathbb{X} \\right)^{-1} \\mathbb{X}'\\mb{Y}.\n$$\n\nOf course, we can also define the vector of residuals,\n$$ \n \\widehat{\\mb{e}} = \\mb{Y} - \\mathbb{X}\\bhat = \\left[\n\\begin{array}{c}\n Y_1 \\\\\n Y_2 \\\\\n \\vdots \\\\\n Y_n\n \\end{array}\n\\right] - \n\\left[\n\\begin{array}{c}\n 1\\widehat{\\beta}_0 + X_{11}\\widehat{\\beta}_1 + X_{12}\\widehat{\\beta}_2 + \\dots + X_{1k}\\widehat{\\beta}_k \\\\\n 1\\widehat{\\beta}_0 + X_{21}\\widehat{\\beta}_1 + X_{22}\\widehat{\\beta}_2 + \\dots + X_{2k}\\widehat{\\beta}_k \\\\\n \\vdots \\\\\n 1\\widehat{\\beta}_0 + X_{n1}\\widehat{\\beta}_1 + X_{n2}\\widehat{\\beta}_2 + \\dots + X_{nk}\\widehat{\\beta}_k\n\\end{array}\n\\right],\n$$\nand so the sum of the squared residuals, in this case, becomes\n$$ \nSSR(\\bfbeta) = \\Vert\\mb{Y} - \\mathbb{X}\\bfbeta\\Vert^{2} = (\\mb{Y} - \\mathbb{X}\\bfbeta)'(\\mb{Y} - \\mathbb{X}\\bfbeta),\n$$\nwhere the double vertical lines mean the Euclidean norm of the argument, $\\Vert \\mb{z} \\Vert = \\sqrt{\\sum_{i=1}^n z_i^{2}}$. The OLS minimization problem, then, is \n$$ \n\\bhat = \\argmin_{\\mb{b} \\in \\mathbb{R}^{(k+1)}}\\; \\Vert\\mb{Y} - \\mathbb{X}\\mb{b}\\Vert^{2}\n$$\nFinally, we can write the orthogonality of the covariates and the residuals as\n$$ \n\\mathbb{X}'\\widehat{\\mb{e}} = \\sum_{i=1}^{n} \\X_{i}\\widehat{e}_{i} = 0.\n$$\n\n## Rank, linear independence, and multicollinearity {#sec-rank}\n\nWhen introducing the OLS estimator, we noted that it would exist when $\\sum_{i=1}^n \\X_i\\X_i'$ is positive definite or that there is \"no multicollinearity.\" This assumption is equivalent to saying that the matrix $\\mathbb{X}$ is full column rank, meaning that $\\text{rank}(\\mathbb{X}) = (k+1)$, where $k+1$ is the number of columns of $\\mathbb{X}$. Recall from matrix algebra that the column rank is the number of linearly independent columns in the matrix, and **linear independence** means that $\\mathbb{X}\\mb{b} = 0$ if and only if $\\mb{b}$ is a column vector of 0s. In other words, we have\n$$ \nb_{1}\\mathbb{X}_{1} + b_{2}\\mathbb{X}_{2} + \\cdots + b_{k+1}\\mathbb{X}_{k+1} = 0 \\quad\\iff\\quad b_{1} = b_{2} = \\cdots = b_{k+1} = 0, \n$$\nwhere $\\mathbb{X}_j$ is the $j$th column of $\\mathbb{X}$. Thus, full column rank says that all the columns are linearly independent or that there is no \"multicollinearity.\"\n\nHow could this be violated? Suppose we accidentally included a linear function of one variable so that $\\mathbb{X}_2 = 2\\mathbb{X}_1$. Then we have,\n$$ \n\\begin{aligned}\n \\mathbb{X}\\mb{b} &= b_{1}\\mathbb{X}_{1} + b_{2}2\\mathbb{X}_1+ b_{3}\\mathbb{X}_{3}+ \\cdots + b_{k+1}\\mathbb{X}_{k+1} \\\\\n &= (b_{1} + 2b_{2})\\mathbb{X}_{1} + b_{3}\\mathbb{X}_{3} + \\cdots + b_{k+1}\\mathbb{X}_{k+1}\n\\end{aligned}\n$$\nIn this case, this expression equals 0 when $b_3 = b_4 = \\cdots = b_{k+1} = 0$ and $b_1 = -2b_2$. Thus, the collection of columns is linearly dependent, so we know that the rank of $\\mathbb{X}$ must be less than full column rank (that is, less than $k+1$). Hopefully, it is also clear that if we removed the problematic column $\\mathbb{X}_2$, the resulting matrix would have $k$ linearly independent columns, implying that $\\mathbb{X}$ is rank $k$. \n\nWhy does this rank condition matter for the OLS estimator? A key property of full column rank matrices is that $\\Xmat$ is of full column rank if and only if $\\Xmat'\\Xmat$ is non-singular and a matrix is invertible if and only if it is non-singular. Thus, the columns of $\\Xmat$ being linearly independent means that the inverse $(\\Xmat'\\Xmat)^{-1}$ exists and so does $\\bhat$. Further, this full rank condition also implies that $\\Xmat'\\Xmat = \\sum_{i=1}^{n}\\X_{i}\\X_{i}'$ is positive definite, implying that the estimator is truly finding the minimal sum of squared residuals.\n\nWhat are common situations that lead to violations of no multicollinearity? We have seen one above, with one variable being a linear function of another. But this problem can come out in more subtle ways. Suppose that we have a set of dummy variables corresponding to a single categorical variable, like the region of the country. In the US, this might mean we have $X_{i1} = 1$ for units in the West (0 otherwise), $X_{i2} = 1$ for units in the Midwest (0 otherwise), $X_{i3} = 1$ for units in the South (0 otherwise), and $X_{i4} = 1$ for units in the Northeast (0 otherwise). Each unit has to be in one of these four regions, so there is a linear dependence between these variables, \n$$ \nX_{i4} = 1 - X_{i1} - X_{i2} - X_{i3}.\n$$\nThat is, if I know that you are not in the West, Midwest, or South regions, I know that you are in the Northeast. We would get a linear dependence if we tried to include all of these variables in our regression with an intercept. (Note the 1 in the relationship between $X_{i4}$ and the other variables, that's why there will be linear dependence when including a constant.) Thus, we usually omit one dummy variable from each categorical variable. In that case, the coefficients on the remaining dummies are differences in means between that category and the omitted one (perhaps conditional on other variables included, if included). So if we omitted $X_{i4}$, then the coefficient on $X_{i1}$ would be the difference in mean outcomes between units in the West and Northeast regions. \n\nAnother way collinearity can occur is if you include both an intercept term and a variable that does not vary. This issue can often happen if we mistakenly subset our data to, say, the West region but still include the West dummy variable in the regression. \n\nFinally, note that most statistical software packages will \"solve\" the multicollinearity by arbitrarily removing as many linearly dependent covariates as is necessary to achieve full rank. R will show the estimated coefficients as `NA` in those cases. \n\n## OLS coefficients for binary and categorical regressors\n\nSuppose that the covariates include just the intercept and a single binary variable, $\\X_i = (1\\; X_{i})'$, where $X_i \\in \\{0,1\\}$. In this case, the OLS coefficient on $X_i$, $\\widehat{\\beta_{1}}$, is exactly equal to the difference in sample means of $Y_i$ in the $X_i = 1$ group and the $X_i = 0$ group:\n$$ \n\\widehat{\\beta}_{1} = \\frac{\\sum_{i=1}^{n} X_{i}Y_{i}}{\\sum_{i=1}^{n} X_{i}} - \\frac{\\sum_{i=1}^{n} (1 - X_{i})Y_{i}}{\\sum_{i=1}^{n} 1- X_{i}} = \\overline{Y}_{X =1} - \\overline{Y}_{X=0}\n$$\nThis result is not an approximation. It holds exactly for any sample size. \n\nWe can generalize this idea to discrete variables more broadly. Suppose we have our region variables from the last section and include in our covariates a constant and the dummies for the West, Midwest, and South regions. Then the coefficient on the West dummy will be\n$$ \n\\widehat{\\beta}_{\\text{west}} = \\overline{Y}_{\\text{west}} - \\overline{Y}_{\\text{northeast}},\n$$\nwhich is exactly the difference in sample means of $Y_i$ between the West region and units in the \"omitted region,\" the Northeast. \n\nNote that these interpretations only hold when the regression consists solely of the binary variable or the set of categorical dummy variables. These exact relationships fail when other covariates are added to the model. \n\n\n\n## Projection and geometry of least squares\n\nOLS has a very nice geometric interpretation that can add a lot of intuition for various aspects of the method. In this geometric approach, we view $\\mb{Y}$ as an $n$-dimensional vector in $\\mathbb{R}^n$. As we saw above, OLS in matrix form is about finding a linear combination of the covariate matrix $\\Xmat$ closest to this vector in terms of the Euclidean distance (which is just the sum of squares). \n\nLet $\\mathcal{C}(\\Xmat) = \\{\\Xmat\\mb{b} : \\mb{b} \\in \\mathbb{R}^(k+1)\\}$ be the **column space** of the matrix $\\Xmat$. This set is all linear combinations of the columns of $\\Xmat$ or the set of all possible linear predictions we could obtain from $\\Xmat$. Notice that the OLS fitted values, $\\Xmat\\bhat$, are in this column space. If, as we assume, $\\Xmat$ has full column rank of $k+1$, then the column space $\\mathcal{C}(\\Xmat)$ will be a $k+1$-dimensional surface inside of the larger $n$-dimensional space. If $\\Xmat$ has two columns, the column space will be a plane. \n\nAnother interpretation of the OLS estimator is that it finds the linear predictor as the closest point in the column space of $\\Xmat$ to the outcome vector $\\mb{Y}$. This is called the **projection** of $\\mb{Y}$ onto $\\mathcal{C}(\\Xmat)$. @fig-projection shows this projection for a case with $n=3$ and 2 columns in $\\Xmat$. The shaded blue region represents the plane of the column space of $\\Xmat$, and we can see that $\\Xmat\\bhat$ is the closest point to $\\mb{Y}$ in that space. That's the whole idea of the OLS estimator: find the linear combination of the columns of $\\Xmat$ (a point in the column space) that minimizes the Euclidean distance between that point and the outcome vector (the sum of squared residuals).\n\n![Projection of Y on the column space of the covariates.](assets/img/projection-drawing.png){#fig-projection}\n\nThis figure shows that the residual vector, which is the difference between the $\\mb{Y}$ vector and the projection $\\Xmat\\bhat$, is perpendicular or orthogonal to the column space of $\\Xmat$. This orthogonality is a consequence of the residuals being orthogonal to all the columns of $\\Xmat$,\n$$ \n\\Xmat'\\mb{e} = 0,\n$$\nas we established above. Being orthogonal to all the columns means it will also be orthogonal to all linear combinations of the columns. \n\n## Projection and annihilator matrices\n\nNow that we have the idea of projection to the column space of $\\Xmat$, we can define a way to project any vector into that space. The $n\\times n$ **projection matrix,**\n$$\n\\mb{P}_{\\Xmat} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat',\n$$\nprojects a vector into $\\mathcal{C}(\\Xmat)$. In particular, we can see that this gives us the fitted values for $\\mb{Y}$:\n$$ \n\\mb{P}_{\\Xmat}\\mb{Y} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'\\mb{Y} = \\Xmat\\bhat.\n$$\nBecause we sometimes write the linear predictor as $\\widehat{\\mb{Y}} = \\Xmat\\bhat$, the projection matrix is also called the **hat matrix**. With either name, multiplying a vector by $\\mb{P}_{\\Xmat}$ gives the best linear predictor of that vector as a function of $\\Xmat$. Intuitively, any vector that is already a linear combination of the columns of $\\Xmat$ (so is in $\\mathcal{C}(\\Xmat)$) should be unaffected by this projection: the closest point in $\\mathcal{C}(\\Xmat)$ to a point already in $\\mathcal{C}(\\Xmat)$ is itself. We can also see this algebraically for any linear combination $\\Xmat\\mb{c}$,\n$$\n\\mb{P}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'\\Xmat\\mb{c} = \\Xmat\\mb{c},\n$$\nbecause $(\\Xmat'\\Xmat)^{-1} \\Xmat'\\Xmat$ simplifies to the identity matrix. In particular, the projection of $\\Xmat$ onto itself is just itself: $\\mb{P}_{\\Xmat}\\Xmat = \\Xmat$. \n\nThe second matrix related to projection is the **annihilator matrix**, \n$$ \n\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\mb{P}_{\\Xmat},\n$$\nwhich projects any vector into the orthogonal complement to the column space of $\\Xmat$, \n$$\n\\mathcal{C}^{\\perp}(\\Xmat) = \\{\\mb{c} \\in \\mathbb{R}^n\\;:\\; \\Xmat\\mb{c} = 0 \\}.\n$$\nThis matrix is called the annihilator matrix because if you apply it to any linear combination of $\\Xmat$, you get 0:\n$$ \n\\mb{M}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat\\mb{c} - \\mb{P}_{\\Xmat}\\Xmat\\mb{c} = \\Xmat\\mb{c} - \\Xmat\\mb{c} = 0,\n$$\nand in particular, $\\mb{M}_{\\Xmat}\\Xmat = 0$. Why should we care about this matrix? Perhaps a more evocative name might be the **residual maker** since it makes residuals when applied to $\\mb{Y}$,\n$$ \n\\mb{M}_{\\Xmat}\\mb{Y} = (\\mb{I}_{n} - \\mb{P}_{\\Xmat})\\mb{Y} = \\mb{Y} - \\mb{P}_{\\Xmat}\\mb{Y} = \\mb{Y} - \\Xmat\\bhat = \\widehat{\\mb{e}}.\n$$\n\n\n\nThere are several fundamental properties of the projection matrix that are useful: \n\n- $\\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}$ are **idempotent**, which means that when applied to itself, it simply returns itself: $\\mb{P}_{\\Xmat}\\mb{P}_{\\Xmat} = \\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}\\mb{M}_{\\Xmat} = \\mb{M}_{\\Xmat}$. \n\n- $\\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}$ are symmetric $n \\times n$ matrices so that $\\mb{P}_{\\Xmat}' = \\mb{P}_{\\Xmat}$ and $\\mb{M}_{\\Xmat}' = \\mb{M}_{\\Xmat}$.\n\n- The rank of $\\mb{P}_{\\Xmat}$ is $k+1$ (the number of columns of $\\Xmat$) and the rank of $\\mb{M}_{\\Xmat}$ is $n - k - 1$. \n\nWe can use the projection and annihilator matrices to arrive at an orthogonal decomposition of the outcome vector:\n$$ \n\\mb{Y} = \\Xmat\\bhat + \\widehat{\\mb{e}} = \\mb{P}_{\\Xmat}\\mb{Y} + \\mb{M}_{\\Xmat}\\mb{Y}.\n$$\n \n\n\n::: {.content-hidden}\n\n## Trace of a matrix\n\nRecall that the trace of a $k \\times k$ square matrix, $\\mb{A} = {a_{ij}}$, is sum the sum of the diagonal entries,\n$$\n\\text{trace}(\\mb{A}) = \\sum_{i=1}^{k} a_{ii},\n$$\nso, for example, $\\text{trace}(\\mb{I}_{n}) = n$. A couple of key properties of the trace:\n\n- Trace is linear: $\\text{trace}(k\\mb{A}) = k\\; \\text{trace}(\\mb{a})$ and $\\text{trace}(\\mb{A} + \\mb{B}) = \\text{trace}(\\mb{A}) + \\text{trace}(\\mb{B})$\n- Trace is invariant to multiplication direction: $\\text{trace}(\\mb{AB}) = \\text{trace}(\\mb{BA})$. \n:::\n\n\n## Residual regression\n\nThere are many situations where we can partition the covariates into two groups, and we might wonder if it is possible how to express or calculate the OLS coefficients for just one set of covariates. In particular, let the columns of $\\Xmat$ be partitioned into $[\\Xmat_{1} \\Xmat_{2}]$, so that the linear prediction we are estimating is \n$$ \n\\mb{Y} = \\Xmat_{1}\\bfbeta_{1} + \\Xmat_{2}\\bfbeta_{2} + \\mb{e}, \n$$\nwith estimated coefficients and residuals\n$$ \n\\mb{Y} = \\Xmat_{1}\\bhat_{1} + \\Xmat_{2}\\bhat_{2} + \\widehat{\\mb{e}}.\n$$\n\nWe now document another way to obtain the estimator $\\bhat_1$ from this regression using a technique called **residual regression**, **partitioned regression**, or the **Frisch-Waugh-Lovell theorem**.\n \n::: {.callout-note}\n\n## Residual regression approach \n\nThe residual regression approach is:\n\n1. Use OLS to regress $\\mb{Y}$ on $\\Xmat_2$ and obtain residuals $\\widetilde{\\mb{e}}_2$. \n2. Use OLS to regress each column of $\\Xmat_1$ on $\\Xmat_2$ and obtain residuals $\\widetilde{\\Xmat}_1$.\n3. Use OLS to regress $\\widetilde{\\mb{e}}_{2}$ on $\\widetilde{\\Xmat}_1$. \n\n:::\n\n::: {#thm-fwl}\n\n## Frisch-Waugh-Lovell\n\nThe OLS coefficients from a regression of $\\widetilde{\\mb{e}}_{2}$ on $\\widetilde{\\Xmat}_1$ are equivalent to the coefficients on $\\Xmat_{1}$ from the regression of $\\mb{Y}$ on both $\\Xmat_{1}$ and $\\Xmat_2$. \n\n:::\n\nOne implication of this theorem is that the regression coefficient for a given variable captures the relationship between the residual variation in the outcome and that variable after accounting for the other covariates. In particular, this coefficient focuses on the variation orthogonal to those other covariates. \n\nWhile perhaps unexpected, this result may not appear particularly useful. We can just run the long regression, right? This trick can be handy when $\\Xmat_2$ consists of dummy variables (or \"fixed effects\") for a categorical variable with many categories. For example, suppose $\\Xmat_2$ consists of indicators for the county of residence for a respondent. In that case, that will have over 3,000 columns, meaning that direct calculation of the $\\bhat = (\\bhat_{1}, \\bhat_{2})$ will require inverting a matrix that is bigger than $3,000 \\times 3,000$. Computationally, this process will be very slow. But above, we saw that predictions of an outcome on a categorical variable are just the sample mean within each level of the variable. Thus, in this case, the residuals $\\widetilde{\\mb{e}}_2$ and $\\Xmat_1$ can be computed by demeaning the outcome and $\\Xmat_1$ within levels of the dummies in $\\Xmat_2$, which can be considerably faster computationally. \n\nFinally, there are data visualization reasons to use residual regression. It is often difficult to see if the linear functional form for some covariate is appropriate once you begin to control for other variables. One can check the relationship using this approach with a scatterplot of $\\widetilde{\\mb{e}}_2$ on $\\Xmat_1$ (when it is a single column). \n\n\n## Outliers, leverage points, and influential observations\n\nGiven that OLS finds the coefficients that minimize the sum of the squared residuals, it is helpful to ask how much impact each residual has on that solution. Let $\\bhat_{(-i)}$ be the OLS estimates if we omit unit $i$. Intuitively, **influential observations** should significantly impact the estimated coefficients so that $\\bhat_{(-i)} - \\bhat$ is large in absolute value. \n\nUnder what conditions will we have influential observations? OLS tries to minimize the sum of **squared** residuals, so it will move more to shrink larger residuals than smaller ones. Where are large residuals likely to occur? Well, notice that any OLS regression line with a constant will go through the means of the outcome and the covariates: $\\overline{Y} = \\overline{\\X}\\bhat$. Thus, by definition, this means that when an observation is close to the average of the covariates, $\\overline{\\X}$, it cannot have that much influence because OLS forces the regression line to go through $\\overline{Y}$. Thus, we should look for influential points that have two properties:\n\n1. Have high **leverage**, where leverage roughly measures how far $\\X_i$ is from $\\overline{\\X}$, and\n2. Be an **outlier** in the sense of having a large residual (if left out of the regression).\n\nWe'll take each of these in turn. \n\n### Leverage points {#sec-leverage}\n\nWe can define the **leverage** of an observation by\n$$ \nh_{ii} = \\X_{i}'\\left(\\Xmat'\\Xmat\\right)^{-1}\\X_{i},\n$$\nwhich is the $i$th diagonal entry of the projection matrix, $\\mb{P}_{\\Xmat}$. Notice that \n$$ \n\\widehat{\\mb{Y}} = \\mb{P}_{\\Xmat}\\mb{Y} \\qquad \\implies \\qquad \\widehat{Y}_i = \\sum_{j=1}^n h_{ij}Y_j,\n$$\nso that $h_{ij}$ is the importance of observation $j$ for the fitted value for observation $i$. The leverage, then, is the importance of the observation for its own fitted value. We can also interpret these values in terms of the distribution of $\\X_{i}$. Roughly speaking, these values are the weighted distance $\\X_i$ is from $\\overline{\\X}$, where the weights normalize to the empirical variance/covariance structure of the covariates (so that the scale of each covariate is roughly the same). We can see this most clearly when we fit a simple linear regression (with one covariate and an intercept) with OLS when the leverage is\n$$ \nh_{ii} = \\frac{1}{n} + \\frac{(X_i - \\overline{X})^2}{\\sum_{j=1}^n (X_j - \\overline{X})^2}\n$$\n\nLeverage values have three key properties:\n\n1. $0 \\leq h_{ii} \\leq 1$\n2. $h_{ii} \\geq 1/n$ if the model contains an intercept\n2. $\\sum_{i=1}^{n} h_{ii} = k + 1$\n\n### Outliers and leave-one-out regression\n\nIn the context of OLS, an **outlier** is an observation with a large prediction error for a particular OLS specification. @fig-outlier shows an example of an outlier. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An example of an outlier.](07_least_squares_files/figure-pdf/fig-outlier-1.pdf){#fig-outlier}\n:::\n:::\n\n\n\nIntuitively, it seems as though we could use the residual $\\widehat{e}_i$ to assess the prediction error for a given unit. But the residuals are not valid predictions because the OLS estimator is designed to make those as small as possible (in machine learning parlance, these were in the training set). In particular, if an outlier is influential, we already noted that it might \"pull\" the regression line toward it, and the resulting residual might be pretty small. \n\nTo assess prediction errors more cleanly, we can use **leave-one-out regression** (LOO), which regresses $\\mb{Y}_{(-i)}$ on $\\Xmat_{(-i)}$, where these omit unit $i$:\n$$ \n\\bhat_{(-i)} = \\left(\\Xmat'_{(-i)}\\Xmat_{(-i)}\\right)^{-1}\\Xmat_{(-i)}\\mb{Y}_{(-i)}.\n$$\nWe can then calculate LOO prediction errors as\n$$ \n\\widetilde{e}_{i} = Y_{i} - \\X_{i}'\\bhat_{(-i)}.\n$$\nCalculating these LOO prediction errors for each unit appears to be computationally costly because it seems as though we have to fit OLS $n$ times. Fortunately, there is a closed-form expression for the LOO coefficients and prediction errors in terms of the original regression, \n$$ \n\\bhat_{(-i)} = \\bhat - \\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i \\qquad \\widetilde{e}_i = \\frac{\\widehat{e}_i}{1 - h_{ii}}.\n$$ {#eq-loo-coefs}\nWe can see from this that the LOO prediction errors will differ from the residuals when the leverage of a unit is high. This makes sense! We said earlier that observations with low leverage would be close to $\\overline{\\X}$, where the outcome values have relatively little impact on the OLS fit (because the regression line must go through $\\overline{Y}$). \n\n### Influence points\n\nAn influence point is an observation that has the power to change the coefficients and fitted values for a particular OLS specification. @fig-influence shows an example of such an influence point. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An example of an influence point.](07_least_squares_files/figure-pdf/fig-influence-1.pdf){#fig-influence}\n:::\n:::\n\n\n\nOne measure of influence, called DFBETA$_i$, measures how much $i$ changes the estimated coefficient vector\n$$ \n\\bhat - \\bhat_{(-i)} = \\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i,\n$$\nso there is one value for each observation-covariate pair. When divided by the standard error of the estimated coefficients, this is called DFBETA**S** (where the \"S\" is for standardized). These are helpful if we focus on a particular coefficient. \n\n\nWhen we want to summarize how much an observation matters for the fit, we can use a compact measure of the influence of an observation by comparing the fitted value from the entire sample to the fitted value from the leave-one-out regression. Using the DFBETA above, we have\n$$ \n\\widehat{Y}_i - \\X_{i}\\bhat_{(-1)} = \\X_{i}'(\\bhat -\\bhat_{(-1)}) = \\X_{i}'\\left( \\Xmat'\\Xmat\\right)^{-1}\\X_i\\widetilde{e}_i = h_{ii}\\widetilde{e}_i,\n$$\nso the influence of an observation is its leverage times how much of an outlier it is. This value is sometimes called DFFIT (difference in fit). One transformation of this quantity, **Cook's distance**, standardizes this by the sum of the squared residuals:\n$$ \nD_i = \\frac{n-k-1}{k+1}\\frac{h_{ii}\\widetilde{e}_{i}^{2}}{\\widehat{\\mb{e}}'\\widehat{\\mb{e}}}.\n$$\nVarious rules exist for establishing cutoffs for identifying an observation as \"influential\" based on these metrics, but they tend to be ad hoc. In any case, it's better to focus on the holistic question of \"how much does this observation matter for my substantive interpretation\" rather than the narrow question of a particular threshold. \n\n\nIt's all well and good to find influential points, but what should you do about it? The first thing to check is that the data is not corrupted somehow. Sometimes influence points occur because of a coding or data entry error. If you have control over that coding, you should fix those errors. You may consider removing the observation if the error appears in the data acquired from another source. Still, when writing up your analyses, you should be extremely transparent about this choice. Another approach is to consider a transformation of the dependent or independent variables, like the natural logarithm, that might dampen the effects of outliers. Finally, consider using methods that are robust to outliers. \n", "supporting": [ "07_least_squares_files/figure-pdf" ], diff --git a/_freeze/07_least_squares/figure-pdf/fig-ajr-scatter-1.pdf b/_freeze/07_least_squares/figure-pdf/fig-ajr-scatter-1.pdf index a5e8051b12132d93a497f031b3bde5250b39c53c..f89d3aaf2027da942fbb238060bfdd06b809be55 100644 GIT binary patch delta 181 zcmZpSZH(REC8ug=WMF7yWM*ol$))d`pW>2OlB%HLVr67tWNZkR+ng`AgqH`-)ip4k zEUzPuBCy#>=K*89siU)jld-9*o29XlsfnwrtC@wPfr+uJv$?5}p{s?toq`QPC9w*2 cc3j0JiA5z9MX70AhK43)7F?>TuKsRZ0L7Xs-2eap delta 181 zcmZpSZH(REC8ug&U}k7)U}9pd$))d`pW>2OlB%HLVr67tWNZkR+ng`AgqH`-)ip4g zEUzPuBCy#>=K*89xucPbk)^A-rJIG3qq(KIvw^dlnW?LriJ^&;g}I@joq`QPC9w*2 cc3j0JiA5z9MX70AhK43)7F?>TuKsRZ0Lh9g;Q#;t diff --git a/_freeze/07_least_squares/figure-pdf/fig-influence-1.pdf b/_freeze/07_least_squares/figure-pdf/fig-influence-1.pdf index f3a19b7f01eacd4b25742c9ce6dfc8b59dfcee5d..c526f42dc965a2e139a3399fc4b3ae12c9c0e18e 100644 GIT binary patch delta 199 zcmewq{waKexs0lzk%6I+k(sHnCYQc%eu_(CNveW|i!slXBgv6%q(5q%w0{5osA7ljLZ#;UCd3~%#7Sj3{2b{fp*v_*br0_ ft6*oxRa}x-R8motn#N^lU}0>`rK;-c@5TiHf=w{A delta 199 zcmewq{waKexs0lTftjJHfr*KUCYQc%eu_(CNveW|i!slXBgvMoK0O^P23z^ESyXXot<3_+?)(7%?&M$&CFfgT+NK^6l@48 giB+((<0>vmEGnreN=@T3G_Wu>=2BI4^>^a}0Kz0OrT_o{ diff --git a/_freeze/07_least_squares/figure-pdf/fig-outlier-1.pdf b/_freeze/07_least_squares/figure-pdf/fig-outlier-1.pdf index ba688656c6551b036b64910e90994e37de194518..e024909f7a1ebdfc56319099fdc4a28c2b07a557 100644 GIT binary patch delta 175 zcmewr{wsWgy^N}%k%6I+k(sHnCYQc%eu_(CNveW|i2OlB%HLVr67tWNZkR+uWnHj!z%X)ip3x zH!x7grvs>DbEcsylasTXv74)flcAf5iGib~qnV4PrG<;3sk4iLn}MZ~iJgLtB>^Rq IKN!mZ0NuSRz5oCK delta 174 zcmeCl=*ifSuB2*UU}k7)U}9pT$))d`pW>2OlB%HLVr67tWNZkR+uWnHj!z%X)ip3t zH!x7grvs>DbEcsylasNVrK6dlv$?agp{tXVg@LoVrJJ*Xk&%V5tFfh}v7LgAB>^Rq IKN!mZ0NTMSq5uE@ diff --git a/_freeze/07_least_squares/figure-pdf/fig-ssr-vs-tss-1.pdf b/_freeze/07_least_squares/figure-pdf/fig-ssr-vs-tss-1.pdf index f15a979669a7fe25a1816b4c2b1b2381720da31f..c6eadcbe6691737879d153d38d332881d3572819 100644 GIT binary patch delta 183 zcmbO@n{nc7#tl)ns)j}ehDJtarpB6F`o8%oE{P?n3K}j}Mg~U4hH$yf^|niRdEi`K z1JlX+-r^_%n}fU`Fvgo18ko44Sh^UxI69lVnVY*fIXStxn3*`4yIQ!oSz6jD*br0_ ft6*oxRa}x-R8motn#N^lX<}r+rK;-c@5TiHxmqpN delta 183 zcmbO@n{nc7#tl)nss;vThNcE4CMKF(`o8%oE{P?n3K}j}Mg~U4hH$yf^|niRdEi`K z1Cz=6-r^_%n}fU`FvdGtS{OMxy1H5z89JI6xmdWGn3)-ym>9U2nj4rHnHktA*br0_ ft6*oxRa}x-R8motn#N^lX<}r+rK;-c@5TiHhIuT< From 786c070f2f9c66025992f10bc271144fa36c0a56 Mon Sep 17 00:00:00 2001 From: Matt Blackwell Date: Thu, 30 Nov 2023 16:50:24 -0500 Subject: [PATCH 3/3] typos in ch 7 (fixes #46, fixes #47, fixes #48, fixes #49, fixes #50, fixed #51, fixes #52, fixes #53, fixes #54, fixes #55) --- 08_ols_properties.qmd | 22 +++++++++--------- .../execute-results/html.json | 4 ++-- .../execute-results/tex.json | 4 ++-- .../figure-pdf/fig-wald-1.pdf | Bin 54723 -> 54723 bytes 4 files changed, 15 insertions(+), 15 deletions(-) diff --git a/08_ols_properties.qmd b/08_ols_properties.qmd index 5977df7..38f16ea 100644 --- a/08_ols_properties.qmd +++ b/08_ols_properties.qmd @@ -6,7 +6,7 @@ In this chapter, we will focus first on the asymptotic properties of OLS because ## Large-sample properties of OLS -As we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: a consistent estimate of the variance of $\bhat$ and the approximate distribution of $\bhat$ in large samples. Remember that since $\bhat$ is a vector, then the variance of that estimator will actually be a variance-covariance matrix. To obtain these two ingredients, we will first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which will include its variance. +As we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: a consistent estimate of the variance of $\bhat$ and the approximate distribution of $\bhat$ in large samples. Remember that since $\bhat$ is a vector, the variance of that estimator will actually be a variance-covariance matrix. To obtain these two ingredients, we will first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which will include its variance. We begin by setting out the assumptions we will need for establishing the large-sample properties of OLS, which are the same as the assumptions needed to ensure that the best linear predictor, $\bhat = \E[\X_{i}\X_{i}']^{-1}\E[\X_{i}Y_{i}]$, is well-defined and unique. @@ -17,9 +17,9 @@ We begin by setting out the assumptions we will need for establishing the large- The linear projection model makes the following assumptions: -1. $\{(Y_{i}, \X_{i})\}_{i=1}^n$ are iid random vectors. +1. $\{(Y_{i}, \X_{i})\}_{i=1}^n$ are iid random vectors -2. $\E[Y_{i}^{2}] < \infty$ (finite outcome variance) +2. $\E[Y^{2}_{i}] < \infty$ (finite outcome variance) 3. $\E[\Vert \X_{i}\Vert^{2}] < \infty$ (finite variances and covariances of covariates) @@ -40,7 +40,7 @@ $$ $$ which implies that $$ -\bhat \inprob \beta + \mb{Q}_{\X\X}^{-1}\E[\X_ie_i] = \beta, +\bhat \inprob \bfbeta + \mb{Q}_{\X\X}^{-1}\E[\X_ie_i] = \bfbeta, $$ by the continuous mapping theorem (the inverse is a continuous function). The linear projection assumptions ensure that LLN applies to these sample means and ensure that $\E[\X_{i}\X_{i}']$ is invertible. @@ -281,7 +281,7 @@ If $\mb{L}$ only has one row, our Wald statistic is the same as the squared $t$ $$ t = \frac{\widehat{\beta}_{j} - \beta_{j}}{\widehat{\se}[\widehat{\beta}_{j}]} \indist \N(0,1) $$ -so $t^2$ will converge in distribution to a $\chi^2_1$ (since a $\chi^2_1$ is just one standard normal squared). After recentering ad rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\mb{L}\bhat = \mb{c}$, we have $W \indist \chi^2_{q}$. +so $t^2$ will converge in distribution to a $\chi^2_1$ (since a $\chi^2_1$ is just one standard normal squared). After recentering and rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\mb{L}\bhat = \mb{c}$, we have $W \indist \chi^2_{q}$. ::: {.callout-note} @@ -302,7 +302,7 @@ The Wald statistic is not a common test provided by standard statistical softwar $$ F = \frac{W}{q}, $$ -which also typically uses the the homoskedastic variance estimator $\mb{V}^{\texttt{lm}}_{\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution is not really statistically justified, it is slightly more conservative than the $\chi^2_q$ distribution, and the inference will converge as $n\to\infty$. So it might be justified as an *ad hoc* small sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and say we have a sample size of $n = 100$. In that case, the critical value for the F test with $\alpha = 0.05$ is +which also typically uses the homoskedastic variance estimator $\mb{V}^{\texttt{lm}}_{\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution is not really statistically justified, it is slightly more conservative than the $\chi^2_q$ distribution, and the inference will converge as $n\to\infty$. So it might be justified as an *ad hoc* small sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and say we have a sample size of $n = 100$. In that case, the critical value for the F test with $\alpha = 0.05$ is ```{r} qf(0.95, df1 = 2, df2 = 100 - 4) @@ -343,7 +343,7 @@ Under the linear regression model assumption, OLS is unbiased for the population $$ \E[\bhat \mid \Xmat] = \bfbeta, $$ -and its conditional sampling variance issue +and its conditional sampling variance is $$ \mb{\V}_{\bhat} = \V[\bhat \mid \Xmat] = \left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \sigma^2_i \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1}, $$ @@ -396,7 +396,7 @@ where $\overset{a}{\sim}$ means approximately asymptotically distributed as. Und $$ \mb{V}_{\bhat} = \left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \sigma^2_i \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1} \approx \mb{V}_{\bfbeta} / n $$ -In practice, these two derivations lead to basically the same variance estimator. Recall the heteroskedastic-consistent variance estimator is +In practice, these two derivations lead to basically the same variance estimator. Recall the heteroskedastic-consistent variance estimator $$ \widehat{\mb{V}}_{\bfbeta} = \left( \frac{1}{n} \Xmat'\Xmat \right)^{-1} \left( \frac{1}{n} \sum_{i=1}^n\widehat{e}_i^2\X_i\X_i' \right) \left( \frac{1}{n} \Xmat'\Xmat \right)^{-1}, $$ @@ -437,7 +437,7 @@ is unbiased, $\E[\widehat{\mb{V}}^{\texttt{lm}}_{\bhat} \mid \Xmat] = \mb{V}^{\t ::: ::: {.proof} -Under homoskedasticity $\sigma^2_i = \sigma^2$ for all $i$. Recall that $\sum_{i=1}^n \X_i\X_i' = \Xmat'\Xmat$ Thus, the conditional sampling variance from @thm-ols-unbiased, +Under homoskedasticity $\sigma^2_i = \sigma^2$ for all $i$. Recall that $\sum_{i=1}^n \X_i\X_i' = \Xmat'\Xmat$. Thus, the conditional sampling variance from @thm-ols-unbiased, $$ \begin{aligned} \V[\bhat \mid \Xmat] &= \left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \sigma^2 \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1} \\ &= \sigma^2\left( \Xmat'\Xmat \right)^{-1}\left( \sum_{i=1}^n \X_i\X_i' \right) \left( \Xmat'\Xmat \right)^{-1} \\&= \sigma^2\left( \Xmat'\Xmat \right)^{-1}\left( \Xmat'\Xmat \right) \left( \Xmat'\Xmat \right)^{-1} \\&= \sigma^2\left( \Xmat'\Xmat \right)^{-1} = \mb{V}^{\texttt{lm}}_{\bhat}. @@ -456,12 +456,12 @@ where the first equality is because $\mb{M}_{\Xmat} = \mb{I}_{n} - \Xmat (\Xmat' $$ \V[\widehat{e}_i \mid \Xmat] = \E[\widehat{e}_{i}^{2} \mid \Xmat] = (1 - h_{ii})\sigma^{2}. $$ -In the last chapter, we established one property of these leverage values in @sec-leverage is that $\sum_{i=1}^n h_{ii} = k+ 1$, so $\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have +In the last chapter, we established one property of these leverage values in @sec-leverage, namely $\sum_{i=1}^n h_{ii} = k+ 1$, so $\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have $$ \begin{aligned} \E[\widehat{\sigma}^{2} \mid \Xmat] &= \frac{1}{n-k-1} \sum_{i=1}^{n} \E[\widehat{e}_{i}^{2} \mid \Xmat] \\ &= \frac{\sigma^{2}}{n-k-1} \sum_{i=1}^{n} 1 - h_{ii} \\ - &= \sigma^{2} + &= \sigma^{2}. \end{aligned} $$ This establishes $\E[\widehat{\mb{V}}^{\texttt{lm}}_{\bhat} \mid \Xmat] = \mb{V}^{\texttt{lm}}_{\bhat}$. diff --git a/_freeze/08_ols_properties/execute-results/html.json b/_freeze/08_ols_properties/execute-results/html.json index 52c3c36..2d75d34 100644 --- a/_freeze/08_ols_properties/execute-results/html.json +++ b/_freeze/08_ols_properties/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "2c4a143aed2c71fb6dce898bef9e774a", + "hash": "92c14d8fe0bbdcbbdcc4b09f07ed1ada", "result": { - "markdown": "# The statistics of least squares\n\nIn the last chapter, we derived the least squares estimator and investigated many of its mechanical properties. These properties are essential for the practical application of OLS. Still, we should also understand its statistical properties, such as the ones described in Part I: unbiasedness, sampling variance, consistency, and asymptotic normality. As we saw then, these properties fall into finite-sample (unbiasedness, sampling variance) and asymptotic (consistency, asymptotic normality). \n\nIn this chapter, we will focus first on the asymptotic properties of OLS because those properties hold under the relatively mild conditions of the linear projection model introduced in @sec-linear-projection. We will see that OLS consistently estimates a coherent quantity of interest (the best linear predictor) regardless of whether the conditional expectation is linear. That is, for the asymptotic properties of the estimator, we will not need the commonly invoked linearity assumption. Later, when we investigate the finite-sample properties, we will show how linearity will help us establish unbiasedness and how normality of the errors can allow us to conduct exact, finite-sample inference. But these assumptions are very strong, so it's vital to understand what we can say about OLS without making them. \n\n## Large-sample properties of OLS\n\nAs we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: a consistent estimate of the variance of $\\bhat$ and the approximate distribution of $\\bhat$ in large samples. Remember that since $\\bhat$ is a vector, then the variance of that estimator will actually be a variance-covariance matrix. To obtain these two ingredients, we will first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which will include its variance. \n\n\nWe begin by setting out the assumptions we will need for establishing the large-sample properties of OLS, which are the same as the assumptions needed to ensure that the best linear predictor, $\\bhat = \\E[\\X_{i}\\X_{i}']^{-1}\\E[\\X_{i}Y_{i}]$, is well-defined and unique. \n\n::: {.callout-note}\n\n### Linear projection assumptions\n\nThe linear projection model makes the following assumptions:\n\n1. $\\{(Y_{i}, \\X_{i})\\}_{i=1}^n$ are iid random vectors. \n\n2. $\\E[Y_{i}^{2}] < \\infty$ (finite outcome variance)\n\n3. $\\E[\\Vert \\X_{i}\\Vert^{2}] < \\infty$ (finite variances and covariances of covariates)\n\n2. $\\E[\\X_{i}\\X_{i}']$ is positive definite (no linear dependence in the covariates)\n:::\n\n\nRecall that these are mild conditions on the joint distribution of $(Y_{i}, \\X_{i})$ and in particular, we are **not** assuming linearity of the CEF, $\\E[Y_{i} \\mid \\X_{i}]$, nor are we assuming any specific distribution for the data. \n\nWe can helpfully decompose the OLS estimator into the actual BLP coefficient plus estimation error as\n$$ \n\\bhat = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_iY_i \\right) = \\bfbeta + \\underbrace{\\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\right)}_{\\text{estimation error}}.\n$$ \n \nThis decomposition will help us quickly establish the consistency of $\\bhat$. By the law of large numbers, we know that sample means will converge in probability to population expectations, so we have\n$$ \n\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\inprob \\E[\\X_i\\X_i'] \\equiv \\mb{Q}_{\\X\\X} \\qquad \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\inprob \\E[\\X_{i} e_{i}] = \\mb{0},\n$$\nwhich implies that \n$$\n\\bhat \\inprob \\beta + \\mb{Q}_{\\X\\X}^{-1}\\E[\\X_ie_i] = \\beta,\n$$\nby the continuous mapping theorem (the inverse is a continuous function). The linear projection assumptions ensure that LLN applies to these sample means and ensure that $\\E[\\X_{i}\\X_{i}']$ is invertible. \n\n\n::: {#thm-ols-consistency}\nUnder the above linear projection assumptions, the OLS estimator is consistent for the best linear projection coefficients, $\\bhat \\inprob \\bfbeta$.\n:::\n\nThus, OLS should be close to the population linear regression in large samples under relatively mild conditions. Remember that this might not equal the conditional expectation if the CEF is nonlinear. We can say here that OLS converges to the best *linear* approximation to the CEF. Of course, this also means that if the CEF is linear, then OLS will consistently estimate the coefficients of the CEF. \n\nTo emphasize here: the only assumption we made about the dependent variable is that it has finite variance and is iid. Under this assumption, the outcome could be continuous, categorical, binary, or event count. \n\n\nNext, we would like to establish an asymptotic normality result for the OLS coefficients. We first review some key ideas about the central limit theorem.\n\n::: {.callout-note}\n\n## CLT reminder\n\nSuppose that we have a function of the data iid random vectors $\\X_1, \\ldots, \\X_n$, $g(\\X_{i})$ where $\\E[g(\\X_{i})] = 0$ and so $\\V[g(\\X_{i})] = \\E[g(\\X_{i})g(\\X_{i})']$. Then if $\\E[\\Vert g(\\X_{i})\\Vert^{2}] < \\infty$, the CLT implies that\n$$ \n\\sqrt{n}\\left(\\frac{1}{n} \\sum_{i=1}^{n} g(\\X_{i}) - \\E[g(\\X_{i})]\\right) = \\frac{1}{\\sqrt{n}} \\sum_{i=1}^{n} g(\\X_{i}) \\indist \\N(0, \\E[g(\\X_{i})g(\\X_{i}')]) \n$$ {#eq-clt-mean-zero}\n:::\n\nWe now manipulate our decomposition to arrive at the *stabilized* version of the estimator,\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\right).\n$$\nWe have already established that the first term on the right-hand side will converge in probability to $\\mb{Q}_{\\X\\X}^{-1}$. Notice that $\\E[\\X_{i}e_{i}] = 0$, so we can apply @eq-clt-mean-zero to the second term. The covariance matrix of $\\X_ie_{i}$ is \n$$ \n\\mb{\\Omega} = \\V[\\X_{i}e_{i}] = \\E[\\X_{i}e_{i}(\\X_{i}e_{i})'] = \\E[e_{i}^{2}\\X_{i}\\X_{i}'].\n$$ \nThe CLT will imply that\n$$ \n\\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\indist \\N(0, \\mb{\\Omega}).\n$$\nCombining these facts with Slutsky's Theorem implies the following theorem. \n\n::: {#thm-ols-asymptotic-normality}\n\nSuppose that the linear projection assumptions hold and, in addition, we have $\\E[Y_{i}^{4}] < \\infty$ and $\\E[\\lVert\\X_{i}\\rVert^{4}] < \\infty$. Then the OLS estimator is asymptotically normal with\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) \\indist \\N(0, \\mb{V}_{\\bfbeta}),\n$$\nwhere\n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1}.\n$$\n\n:::\n\nThus, if the sample size is large enough, we can approximate the distribution of $\\bhat$ with a multivariate normal with mean $\\bfbeta$ and covariance matrix $\\mb{V}_{\\bfbeta}/n$. In particular, the square root of the $j$th diagonals of this matrix will be standard errors for $\\widehat{\\beta}_j$. Knowing the shape of the OLS estimator's multivariate distribution will allow us to conduct hypothesis tests and generate confidence intervals for both individual coefficients and groups of coefficients. But first, we need an estimate of the covariance matrix!\n\n\n\n## Variance estimation for OLS\n\nThe asymptotic normality of OLS from the last section is of limited value without some way to estimate the covariance matrix, \n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1}.\n$$\nSince each term here is a population mean, this is an ideal place to drop a plug-in estimator. In particular, let's use the following estimators:\n$$ \n\\begin{aligned}\n \\mb{Q}_{\\X\\X} &= \\E[\\X_{i}\\X_{i}'] & \\widehat{\\mb{Q}}_{\\X\\X} &= \\frac{1}{n} \\sum_{i=1}^{n} \\X_{i}\\X_{i}' = \\frac{1}{n}\\Xmat'\\Xmat \\\\\n \\mb{\\Omega} &= \\E[e_i^2\\X_i\\X_i'] & \\widehat{\\mb{\\Omega}} & = \\frac{1}{n}\\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i'.\n\\end{aligned}\n$$\nUnder the assumptions of @thm-ols-asymptotic-normality, the LLN will imply that these are consistent for their targets, $\\widehat{\\mb{Q}}_{\\X\\X} \\inprob \\mb{Q}_{\\X\\X}$ and $\\widehat{\\mb{\\Omega}} \\inprob \\mb{\\Omega}$. We can plug these into the variance formula to arrive at\n$$ \n\\begin{aligned}\n \\widehat{\\mb{V}}_{\\bfbeta} &= \\widehat{\\mb{Q}}_{\\X\\X}^{-1}\\widehat{\\mb{\\Omega}}\\widehat{\\mb{Q}}_{\\X\\X}^{-1} \\\\\n &= \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n\\end{aligned}\n$$\nwhich by the continuous mapping theorem is consistent, $\\widehat{\\mb{V}}_{\\bfbeta} \\inprob \\mb{V}_{\\bfbeta}$. \n\nThis estimator is sometimes called the **robust variance estimator** or, more accurately, the **heteroskedasticity-consistent (HC) variance estimator**. How is this robust? Consider the standard **homoskedasticity** assumption that most statistical software packages make when estimating OLS variances: the variance of the errors does not depend on the covariates: $\\V[e_{i}^{2} \\mid \\X_{i}] = \\V[e_{i}^{2}]$. This assumption is stronger than we need, and we can rely on a weaker assumption that the squared errors are uncorrelated with a specific function of the covariates: \n$$ \n\\E[e_{i}^{2}\\X_{i}\\X_{i}'] = \\E[e_{i}^{2}]\\E[\\X_{i}\\X_{i}'] = \\sigma^{2}\\mb{Q}_{\\X\\X}, \n$$\nwhere $\\sigma^2$ is the variance of the residuals (since $\\E[e_{i}] = 0$). Homoskedasticity simplifies the asymptotic variance of the stabilized estimator, $\\sqrt{n}(\\bhat - \\bfbeta)$, to\n$$ \n\\mb{V}^{\\texttt{lm}}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\sigma^{2}\\mb{Q}_{\\X\\X}\\mb{Q}_{\\X\\X}^{-1} = \\sigma^2\\mb{Q}_{\\X\\X}^{-1}.\n$$\nWe already have an estimator for $\\mb{Q}_{\\X\\X}$, but we need one for $\\sigma^2$. We can easily use the SSR,\n$$ \n\\widehat{\\sigma}^{2} = \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\widehat{e}_{i}^{2},\n$$\nwhere we use $n-k-1$ in the denominator instead of $n$ to correct for the residuals being slightly less variable than the actual errors (because OLS mechanically attempts to make the residuals small). For consistent variance estimation, $n-k -1$ or $n$ can be used, since either way $\\widehat{\\sigma}^2 \\inprob \\sigma^2$. Thus, under homoskedasticity, we have\n$$ \n\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}} = \\widehat{\\sigma}^{2}\\left(\\Xmat'\\Xmat\\right)^{{-1}},\n$$\nwhich is the standard variance estimator used by `lm()` in R or `reg` in Stata. \n\n\nNow that we have two estimators, $\\widehat{\\mb{V}}_{\\bfbeta}$ and $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$, how do they compare? Notice that the HC variance estimator and the homoskedasticity variance estimator will both be consistent when homoskedasticity holds. But as the \"heteroskedasticity-consistent\" label implies, only the HC variance estimator will be consistent when homoskedasticity fails to hold. So $\\widehat{\\mb{V}}_{\\bfbeta}$ has the advantage of being consistent regardless of this assumption. This advantage comes at a cost, however. When homoskedasticity is correct, $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$ incorporates that assumption into the estimator where the HC variance estimator has to estimate it. The HC estimator will have higher variance (the variance estimator will be more variable!) when homoskedasticity actually does hold. \n\n\n\n\n\nNow that we have established the asymptotic normality of the OLS estimator and developed a consistent estimator of its variance, we can proceed with all of the statistical inference tools we discussed in Part I of this guide. Define the estimated **heteroskedasticity-consistent standard errors** as\n$$ \n\\widehat{\\se}(\\widehat{\\beta}_{j}) = \\sqrt{\\frac{[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}}{n}},\n$$\nwhere $[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}$ is the $j$th diagonal entry of the HC variance estimator. Note that we divide by $\\sqrt{n}$ here because $\\widehat{\\mb{V}}_{\\bfbeta}$ is a consistent estimator of the stabilized estimator $\\sqrt{n}(\\bhat - \\bfbeta)$ not the estimator itself. \n\nHypothesis tests and confidence intervals for individual coefficients are almost precisely the same as with the general case presented in Part I. For a two-sided test of $H_0: \\beta_j = b$ versus $H_1: \\beta_j \\neq b$, we can build the t-statistic and conclude that, under the null,\n$$\n\\frac{\\widehat{\\beta}_j - b}{\\widehat{\\se}(\\widehat{\\beta}_{j})} \\indist \\N(0, 1).\n$$\nTypically, statistical software will helpfully provide the t-statistic for the null of no (partial) linear relationship between $X_{ij}$ and $Y_i$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nwhich measures how large the estimated coefficient is in standard errors. With $\\alpha = 0.05$, asymptotic normality would imply that we reject this null when $t > 1.96$. We can form asymptotically-valid confidence intervals with \n$$ \n\\left[\\widehat{\\beta}_{j} - z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j}),\\;\\widehat{\\beta}_{j} + z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j})\\right]. \n$$\nFor reasons we will discuss below, standard software typically relies on the $t$ distribution instead of the normal for hypothesis testing and confidence intervals. Still, this difference is of little consequence in large samples. \n\n## Inference for multiple parameters\n\nWith multiple coefficients, we might have hypotheses that involve more than one coefficient. As an example, let's focus on a regression with an interaction between two covariates, \n$$\nY_i = \\beta_0 + X_i\\beta_1 + Z_i\\beta_2 + X_iZ_i\\beta_3 + e_i.\n$$\nSuppose we wanted to test the hypothesis that $X_i$ does not affect the best linear predictor for $Y_i$. That would be\n$$ \nH_{0}: \\beta_{1} = 0 \\text{ and } \\beta_{3} = 0\\quad\\text{vs}\\quad H_{1}: \\beta_{1} \\neq 0 \\text{ or } \\beta_{3} \\neq 0,\n$$\nwhere we usually write the null more compactly as $H_0: \\beta_1 = \\beta_3 = 0$. \n\nTo test this null hypothesis, we need a test statistic that discriminates the two hypotheses: it should be large when the alternative is true and small when the null is true. With a single coefficient, we usually test the null hypothesis of $H_0: \\beta_j = b_0$ with the $t$-statistic, \n$$ \nt = \\frac{\\widehat{\\beta}_{j} - b_{0}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nand we usually take the absolute value, $|t|$, as our measure of how far our estimate is from the null. But notice that we could also use the square of the $t$ statistic, which is\n$$ \nt^{2} = \\frac{\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{\\V[\\widehat{\\beta}_{j}]} = \\frac{n\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{[\\mb{V}_{\\bfbeta}]_{[jj]}} \n$$ {#eq-squared-t}\n\nSo here's another way to differentiate the null from the alternative: the squared distance between them divided by the variance of the estimate. \n\nCan we generalize this idea to hypotheses about multiple parameters? Adding the sum of squared distances for each component of the null hypothesis is straightforward. For our interaction example, that would be\n$$ \n\\widehat{\\beta}_1^2 + \\widehat{\\beta}_3^2, \n$$\nbut remember that some of the estimated coefficients are noisier than others, so we should account for the uncertainty, just like we did for the $t$-statistic. \n\nWith multiple parameters and multiple coefficients, the variances will now require matrix algebra. We can write any hypothesis about linear functions of the coefficients as $H_{0}: \\mb{L}\\bfbeta = \\mb{c}$. For example, in the interaction case, we have\n$$ \n\\mb{L} =\n\\begin{pmatrix}\n 0 & 1 & 0 & 0 \\\\\n 0 & 0 & 0 & 1 \\\\\n\\end{pmatrix}\n\\qquad\n\\mb{c} =\n\\begin{pmatrix}\n 0 \\\\\n 0\n\\end{pmatrix}\n$$\nThus, $\\mb{L}\\bfbeta = \\mb{0}$ is equivalent to $\\beta_1 = 0$ and $\\beta_3 = 0$. Notice that with other $\\mb{L}$ matrices, we could represent more complicated hypotheses like $2\\beta_1 - \\beta_2 = 34$, though we mostly stick to simpler functions. Let $\\widehat{\\bs{\\theta}} = \\mb{L}\\bhat$ be the OLS estimate of the function of the coefficients. By the delta method (discussed in @sec-delta-method), we have\n$$ \n\\sqrt{n}\\left(\\mb{L}\\bhat - \\mb{L}\\bfbeta\\right) \\indist \\N(0, \\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}).\n$$\nWe can now generalize the squared $t$ statistic in @eq-squared-t. In particular, we will take the distances $\\mb{L}\\bhat - \\mb{c}$ weighted by the variance-covariance matrix $\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}$, \n$$ \nW = n(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}(\\mb{L}\\bhat - \\mb{c}),\n$$\nwhich is called the **Wald test statistic**. This statistic generalizes the ideas of the t-statistic to multiple parameters. With the t-statistic, we recenter to have mean 0 and divide by the standard error to get a variance of 1. If we ignore the middle variance weighting, we have $(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}\\bhat - \\mb{c})$ which is just the sum of the squared deviations of the estimates from the null. Including the $(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}$ weight has the effect of rescaling the distribution of $\\mb{L}\\bhat - \\mb{c}$ to make it rotationally symmetric around 0 (so the resulting dimensions are uncorrelated) with each dimension having an equal variance of 1. In this way, the Wald statistic transforms the random vectors to be mean-centered and have variance 1 (just the t-statistic), but also to have the resulting random variables in the vector be uncorrelated.[^norms]\n\n\n[^norms]: The form of the Wald statistic is that of a weighted inner product, $\\mb{x}'\\mb{Ay}$, where $\\mb{A}$ is a symmetric positive-definite weighting matrix. \n\nWhy transform the data in this way? @fig-wald shows the contour plot of a hypothetical joint distribution of two coefficients from an OLS regression. We might want to know how far different points in the distribution are from the mean, which in this case is $(1, 2)$. Without considering the joint distribution, the circle is obviously closer to the mean than the triangle. However, looking at where the two points are on the distribution, the circle is at a lower contour than the triangle, meaning it is more extreme than the triangle for this particular distribution. The Wald statistic, then, takes into consideration how much of a \"climb\" it is for $\\mb{L}\\bhat$ to get to $\\mb{c}$ given the distribution of $\\mb{L}\\bhat$.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of two slope coefficients. The circle is closer to the center of the distribution by the standard Euclidean distance, but the triangle is closer once you consider the joint distribution.](08_ols_properties_files/figure-html/fig-wald-1.png){#fig-wald width=672}\n:::\n:::\n\n\n\nIf $\\mb{L}$ only has one row, our Wald statistic is the same as the squared $t$ statistic, $W = t^2$. This fact will help us think about the asymptotic distribution of $W$. Notice that as $n\\to\\infty$, we know that by the asymptotic normality of $\\bhat$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{\\widehat{\\se}[\\widehat{\\beta}_{j}]} \\indist \\N(0,1)\n$$\nso $t^2$ will converge in distribution to a $\\chi^2_1$ (since a $\\chi^2_1$ is just one standard normal squared). After recentering ad rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\\mb{L}\\bhat = \\mb{c}$, we have $W \\indist \\chi^2_{q}$. \n\n\n::: {.callout-note}\n\n## Chi-squared critical values\n\nWe can obtain critical values for the $\\chi^2_q$ distribution using the `qchisq()` function in R. For example, if we wanted to obtain the critical value $w$ that such that $\\P(W > w_{\\alpha}) = \\alpha$ for our two-parameter interaction example, we could use:\n\n::: {.cell}\n\n```{.r .cell-code}\nqchisq(p = 0.95, df = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5.991465\n```\n:::\n:::\n\n:::\n\n\nWe need to define the rejection region to use the Wald statistic in a test. Because we are squaring each distance in $W \\geq 0$, larger values of $W$ indicate more disagreement with the null in either direction. Thus, for an $\\alpha$-level test of the joint null, we only need a one-sided rejection region of the form $\\P(W > w_{\\alpha}) = \\alpha$. Obtaining these values is straightforward (see the above callout tip). For $q = 2$ and a $\\alpha = 0.05$, the critical value is roughly 6. \n\n\nThe Wald statistic is not a common test provided by standard statistical software functions like `lm()` in R, though it is fairly straightforward to implement \"by hand.\" Alternatively, packages like [`{aod}`](https://cran.r-project.org/web/packages/aod/index.html) or [`{clubSandwich}`](http://jepusto.github.io/clubSandwich/) have implementations of the test. What is reported by most software implementations of OLS (like `lm()` in R) is the F-statistic, which is\n$$ \nF = \\frac{W}{q},\n$$\nwhich also typically uses the the homoskedastic variance estimator $\\mb{V}^{\\texttt{lm}}_{\\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution is not really statistically justified, it is slightly more conservative than the $\\chi^2_q$ distribution, and the inference will converge as $n\\to\\infty$. So it might be justified as an *ad hoc* small sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and say we have a sample size of $n = 100$. In that case, the critical value for the F test with $\\alpha = 0.05$ is\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqf(0.95, df1 = 2, df2 = 100 - 4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.091191\n```\n:::\n:::\n\n\nThis result implies a critical value of 6.182 on the scale of the Wald statistic (multiplying it by $q = 2$). Compared to the earlier critical value of 5.991 based on the $\\chi^2_2$ distribution, we can see that the inferences will be very similar even in moderately-sized datasets. \n\nFinally, note that the F-statistic reported by `lm()` in R is the test of all the coefficients except the intercept being 0. In modern quantitative social sciences, this test is seldom substantively interesting. \n\n\n## Finite-sample properties with a linear CEF\n\nAll the above results have been large-sample properties, and we have not addressed finite-sample properties like the sampling variance or unbiasedness. Under the linear projection assumption above, OLS is generally biased without stronger assumptions. This section introduces the stronger assumption that will allow us to establish stronger properties for OLS. As usual, however, remember that these stronger assumptions can be wrong. \n\n\n::: {.callout-note}\n## Assumption: Linear Regression Model\n1. The variables $(Y_{i}, \\X_{i})$ satisfy the linear CEF assumption.\n$$ \n\\begin{aligned}\n Y_{i} &= \\X_{i}'\\bfbeta + e_{i} \\\\\n \\E[e_{i}\\mid \\X_{i}] & = 0.\n\\end{aligned}\n$$\n\n2. The design matrix is invertible $\\E[\\X_{i}\\X_{i}'] > 0$ (positive definite).\n:::\n\n\nWe discussed the concept of a linear CEF extensively in @sec-regression. However, recall that the CEF might be linear mechanically if the model is **saturated** or when there are as many coefficients in the model as there are unique values of $\\X_i$. When a model is not saturated, the linear CEF assumption is just that: an assumption. What can this assumption do? It can actually establish quite a few nice statistical properties in finite samples. \n\nOne note before we proceed. When focusing on the finite sample inference for OLS, it is customary to focus on its properties **conditional on the observed covariates**, such as $\\E[\\bhat \\mid \\Xmat]$ or $\\V[\\bhat \\mid \\Xmat]$. The historical reason for this was that the researcher often chose these independent variables, so they were not random. Thus, you'll sometimes see $\\Xmat$ treated as \"fixed\" in some older texts, and they might even omit explicit conditioning statements. \n\n\n::: {#thm-ols-unbiased}\n\nUnder the linear regression model assumption, OLS is unbiased for the population regression coefficients, \n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta,\n$$\nand its conditional sampling variance issue\n$$\n\\mb{\\V}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nwhere $\\sigma^2_{i} = \\E[e_{i}^{2} \\mid \\Xmat]$. \n:::\n\n\n::: {.proof}\n\nTo prove the conditional unbiasedness, recall that we can write the OLS estimator as\n$$\n\\bhat = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e},\n$$\nand so taking (conditional) expectations, we have,\n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta + \\E[(\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\E[\\mb{e} \\mid \\Xmat] = \\bfbeta,\n$$\nbecause under the linear CEF assumption $\\E[\\mb{e}\\mid \\Xmat] = 0$. \n\nFor the conditional sampling variance, we can use the same decomposition we have,\n$$\n\\V[\\bhat \\mid \\Xmat] = \\V[\\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = (\\Xmat'\\Xmat)^{-1}\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat(\\Xmat'\\Xmat)^{-1}. \n$$\nSince $\\E[\\mb{e}\\mid \\Xmat] = 0$, we know that $\\V[\\mb{e}\\mid \\Xmat] = \\E[\\mb{ee}' \\mid \\Xmat]$, which is a matrix with diagonal entries $\\E[e_{i}^{2} \\mid \\Xmat] = \\sigma^2_i$ and off-diagonal entries $\\E[e_{i}e_{j} \\Xmat] = \\E[e_{i}\\mid \\Xmat]\\E[e_{j}\\mid\\Xmat] = 0$, where the first equality follows from the independence of the errors across units. Thus, $\\V[\\mb{e} \\mid \\Xmat]$ is a diagonal matrix with $\\sigma^2_i$ along the diagonal, which means\n$$\n\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat = \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i',\n$$\nestablishing the conditional sampling variance.\n \n:::\n\nThus, for any realization of the covariates, $\\Xmat$, OLS is unbiased for the true regression coefficients $\\bfbeta$. By the law of iterated expectation, we also know that it is unconditionally unbiased[^unconditional] as well since\n$$\n\\E[\\bhat] = \\E[\\E[\\bhat \\mid \\Xmat]] = \\bfbeta. \n$$\nThe difference between these two statements usually isn't incredibly meaningful. \n\n[^unconditional]: We are basically ignoring some edge cases when it comes to discrete covariates here. In particular, we assume that $\\Xmat'\\Xmat$ is nonsingular with probability one. However, this can fail if we have a binary covariate since there is some chance (however slight) that the entire column will be all ones or all zeros, which would lead to a singular matrix $\\Xmat'\\Xmat$. Practically this is not a big deal, but it does mean that we have to ignore this issue theoretically or focus on conditional unbiasedness. \n\n\nThere are a lot of variances flying around, so it's helpful to review them. Above, we derived the asymptotic variance of $\\mb{Z}_{n} = \\sqrt{n}(\\bhat - \\bfbeta)$, \n$$\n\\mb{V}_{\\bfbeta} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1},\n$$\nwhich implies that the approximate variance of $\\bhat$ will be $\\mb{V}_{\\bfbeta} / n$ because\n$$\n\\bhat = \\frac{Z_n}{\\sqrt{n}} + \\bfbeta \\quad\\implies\\quad \\bhat \\overset{a}{\\sim} \\N(\\bfbeta, n^{-1}\\mb{V}_{\\bfbeta}),\n$$\nwhere $\\overset{a}{\\sim}$ means approximately asymptotically distributed as. Under the linear CEF, the conditional sampling variance of $\\bhat$ has a similar form and will be similar to the \n$$\n\\mb{V}_{\\bhat} = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\approx \\mb{V}_{\\bfbeta} / n\n$$\nIn practice, these two derivations lead to basically the same variance estimator. Recall the heteroskedastic-consistent variance estimator is\n$$\n\\widehat{\\mb{V}}_{\\bfbeta} = \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n$$\nis a valid plug-in estimator for the asymptotic variance and\n$$\n\\widehat{\\mb{V}}_{\\bhat} = n^{-1}\\widehat{\\mb{V}}_{\\bfbeta}.\n$$\nThus, in practice, the asymptotic and finite-sample results under a linear CEF justify the same variance estimator. \n\n\n### Linear CEF model under homoskedasticity\n\nIf we are willing to make a homoskedasticity assumption on the errors, we can derive even stronger results for OLS. Stronger assumptions typically lead to stronger conclusions, but those conclusions may not be robust to assumption violations. But homoskedasticity is such a historically important assumption that statistical software implementations of OLS like `lm()` in R assume it. \n\n::: {.callout-note}\n\n## Assumption: Homoskedasticity with a linear CEF\n\nIn addition to the linear CEF assumption, we further assume that\n$$\n\\E[e_i^2 \\mid \\X_i] = \\E[e_i^2] = \\sigma^2,\n$$\nor that variance of the errors does not depend on the covariates. \n:::\n\n\n::: {#thm-homoskedasticity}\n\nUnder a linear CEF model with homoskedastic errors, the conditional sampling variance is\n$$\n\\mb{V}^{\\texttt{lm}}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\sigma^2 \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nand the variance estimator \n$$\n\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} = \\widehat{\\sigma}^2 \\left( \\Xmat'\\Xmat \\right)^{-1} \\quad\\text{where,}\\quad \\widehat{\\sigma}^2 = \\frac{1}{n - k - 1} \\sum_{i=1}^n \\widehat{e}_i^2\n$$\nis unbiased, $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n::: \n\n::: {.proof}\nUnder homoskedasticity $\\sigma^2_i = \\sigma^2$ for all $i$. Recall that $\\sum_{i=1}^n \\X_i\\X_i' = \\Xmat'\\Xmat$ Thus, the conditional sampling variance from @thm-ols-unbiased, \n$$ \n\\begin{aligned}\n\\V[\\bhat \\mid \\Xmat] &= \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2 \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\ &= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\Xmat'\\Xmat \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1} = \\mb{V}^{\\texttt{lm}}_{\\bhat}.\n\\end{aligned}\n$$\n\nFor unbiasedness, we just need to show that $\\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] = \\sigma^2$. Recall that we defined $\\mb{M}_{\\Xmat}$ as the residual-maker because $\\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}}$. We can use this to connect the residuals to the errors,\n$$ \n\\mb{M}_{\\Xmat}\\mb{e} = \\mb{M}_{\\Xmat}\\mb{Y} - \\mb{M}_{\\Xmat}\\Xmat\\bfbeta = \\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}},\n$$ \nso \n$$\n\\V[\\widehat{\\mb{e}} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\V[\\mb{e} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\sigma^2,\n$$\nwhere the first equality is because $\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'$ is constant conditional on $\\Xmat$. Notice that the diagonal entries of this matrix are the variances of particular residuals $\\widehat{e}_i$ and that the diagonal entries of the annihilator matrix are $1 - h_{ii}$ (since the $h_{ii}$ are the diagonal entries of $\\mb{P}_{\\Xmat}$). Thus, we have\n$$ \n\\V[\\widehat{e}_i \\mid \\Xmat] = \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] = (1 - h_{ii})\\sigma^{2}.\n$$\nIn the last chapter, we established one property of these leverage values in @sec-leverage is that $\\sum_{i=1}^n h_{ii} = k+ 1$, so $\\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have\n$$ \n\\begin{aligned}\n \\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] &= \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] \\\\\n &= \\frac{\\sigma^{2}}{n-k-1} \\sum_{i=1}^{n} 1 - h_{ii} \\\\\n &= \\sigma^{2}\n\\end{aligned}\n$$\nThis establishes $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n\n:::\n\n\nThus, under the linear CEF model and homoskedasticity of the errors, we have an unbiased variance estimator that is a simple function of the sum of squared residuals and the design matrix. Most statistical software packages estimate standard errors using $\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}$. \n\n\nThe final result we can derive for the linear CEF under homoskedasticity is an optimality result. We might ask ourselves if there is another estimator for $\\bfbeta$ that would outperform OLS in the sense of having a lower sampling variance. Perhaps surprisingly, no linear estimator for $\\bfbeta$ has a lower conditional variance, meaning that OLS is the **best linear unbiased estimator**, often jovially shortened to BLUE. This result is famously known as the Gauss-Markov Theorem.\n\n::: {#thm-gauss-markov}\n\nLet $\\widetilde{\\bfbeta} = \\mb{AY}$ be a linear and unbiased estimator for $\\bfbeta$. Under the linear CEF model with homoskedastic errors, \n$$\n\\V[\\widetilde{\\bfbeta}\\mid \\Xmat] \\geq \\V[\\bhat \\mid \\Xmat]. \n$$\n\n:::\n\n::: {.proof}\nNote that if $\\widetilde{\\bfbeta}$ is unbiased then $\\E[\\widetilde{\\bfbeta} \\mid \\Xmat] = \\bfbeta$ and so \n$$\n\\bfbeta = \\E[\\mb{AY} \\mid \\Xmat] = \\mb{A}\\E[\\mb{Y} \\mid \\Xmat] = \\mb{A}\\Xmat\\bfbeta,\n$$\nwhich implies that $\\mb{A}\\Xmat = \\mb{I}_n$. \nRewrite the competitor as $\\widetilde{\\bfbeta} = \\bhat + \\mb{BY}$ where,\n$$ \n\\mb{B} = \\mb{A} - \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'.\n$$\nand note that $\\mb{A}\\Xmat = \\mb{I}_n$ implies that $\\mb{B}\\Xmat = 0$. We now have\n$$ \n\\begin{aligned}\n \\widetilde{\\bfbeta} &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{Y} \\\\\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\mb{B}\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e}\n\\end{aligned}\n$$\nThe variance of the competitor is, thus, \n$$ \n\\begin{aligned}\n \\V[\\widetilde{\\bfbeta} \\mid \\Xmat]\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\V[\\mb{e}\\mid \\Xmat]\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)' \\\\\n &= \\sigma^{2}\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\left( \\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{B}'\\right) \\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\mb{B}' + \\mb{B}\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &\\geq \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1} \\\\\n &= \\V[\\bhat \\mid \\Xmat]\n\\end{aligned}\n$$\nThe first equality comes from the properties of covariance matrices; the second is due to homoskedasticity; the fourth is due to $\\mb{B}\\Xmat = 0$, which implies that $\\Xmat'\\mb{B}' = 0$ as well. The fifth inequality holds because matrix products of the form $\\mb{BB}'$ are positive definite if $\\mb{B}$ is of full rank (which we have assumed it is). \n\n:::\n\nIn this proof, we saw that the variance of the competing estimator had variance $\\sigma^2\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)$ which we argued was \"greater than 0\" in the matrix sense, which is also called positive definite. What does this mean practically? Remember that any positive definite matrix must have strictly positive diagonal entries and that the diagonal entries of $\\V[\\bhat \\mid \\Xmat]$ and $V[\\widetilde{\\bfbeta}\\mid \\Xmat]$ are the variances of the individual parameters, $\\V[\\widehat{\\beta}_{j} \\mid \\Xmat]$ and $\\V[\\widetilde{\\beta}_{j} \\mid \\Xmat]$. Thus, the variances of the individual parameters will be larger for $\\widetilde{\\bfbeta}$ than for $\\bhat$.\n\nMany textbooks cite the Gauss-Markov theorem as a critical advantage of OLS over other methods, but it's essential to recognize its limitations. It requires linearity and homoskedastic error assumptions, which can be false in many applications. \n\nFinally, note that while we have shown this result for linear estimators, @Hansen22 proves a more general version of this result that applies to any unbiased estimator. \n\n## The normal linear model\n\nFinally, we add the strongest and thus least loved of the classical linear regression assumption: (conditional) normality of the errors. The historical reason to use this assumption was that finite-sample inference hits a roadblock without some knowledge of the sampling distribution of $\\bhat$. Under the linear CEF model, we saw that it was unbiased, and under homoskedasticity, we could produce an unbiased estimator of the conditional variance. But to do hypothesis testing or generate confidence intervals, we need to be able to make probability statements about the estimator, and for that, we need to know its exact distribution. When the sample size is large, we can rely on the CLT and know it is approximately normal. But in small samples, what do we do? Historically, we decided to assume (conditional) normality of the errors to proceed with some knowledge that we were wrong but hopefully not too wrong. \n\n\n::: {.callout-note}\n\n## The normal linear regression model\n\nIn addition to the linear CEF assumption, we assume that \n$$\ne_i \\mid \\Xmat \\sim \\N(0, \\sigma^2).\n$$\n\n:::\n\nA couple of things to point out: \n\n- The assumption here is not that $(Y_{i}, \\X_{i})$ are jointly normal (though this would be sufficient for the assumption to hold), but rather that $Y_i$ is normally distributed conditional on $\\X_i$. \n- Notice that the normal regression model has the homoskedasticity assumption baked in. \n\n::: {#thm-normal-ols}\n\nUnder the normal linear regression model, we have\n$$ \n\\begin{aligned}\n \\bhat \\mid \\Xmat &\\sim \\N\\left(\\bfbeta, \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1}\\right) \\\\\n \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}]_{jj}/\\sqrt{n}} &\\sim t_{n-k-1} \\\\\n W/q &\\sim F_{q, n-k-1}. \n\\end{aligned}\n$$\n\n:::\n\n\nThis theorem says that in the normal linear regression model, the coefficients follow a normal distribution, the t-statistics follow a $t$-distribution, and a transformation of the Wald statistic follows an $F$ distribution. These are **exact** results and do not rely on large-sample approximations. Under the assumption of conditional normality of the errors, they are as valid for $n = 5$ as for $n = 500,000$. \n\nFew people believe errors follow a normal distribution, so why even present these results? Unfortunately, most statistical software implementations of OLS implicitly assume this when calculating p-values for tests or constructing confidence intervals. That is, the p-value associated with the $t$-statistic that `lm()` outputs in R relies on the $t_{n-k-1}$ distribution, and the critical values used to construct confidence intervals with `confint()` use that distribution as well. When normality does not hold, there is no principled reason to use the $t$ or the $F$ distributions in this way. But we might hold our nose and use this *ad hoc* procedure under two rationalizations:\n\n- $\\bhat$ is asymptotically normal, but this approximation might be poor in smaller finite samples. The $t$ distribution will make inference more conservative in these cases (wider confidence intervals, smaller test rejection regions), which might help offset the poor approximation of the normal in small samples. \n- As $n\\to\\infty$, the $t_{n-k-1}$ will converge to a standard normal distribution, so the *ad hoc* adjustment will not matter much for medium to large samples. \n\nThese arguments are not very convincing since it's unclear whether the $t$ approximation will be any better than the normal in finite samples. But it's the best we can do to console ourselves as we find more data. \n", + "markdown": "# The statistics of least squares\n\nIn the last chapter, we derived the least squares estimator and investigated many of its mechanical properties. These properties are essential for the practical application of OLS. Still, we should also understand its statistical properties, such as the ones described in Part I: unbiasedness, sampling variance, consistency, and asymptotic normality. As we saw then, these properties fall into finite-sample (unbiasedness, sampling variance) and asymptotic (consistency, asymptotic normality). \n\nIn this chapter, we will focus first on the asymptotic properties of OLS because those properties hold under the relatively mild conditions of the linear projection model introduced in @sec-linear-projection. We will see that OLS consistently estimates a coherent quantity of interest (the best linear predictor) regardless of whether the conditional expectation is linear. That is, for the asymptotic properties of the estimator, we will not need the commonly invoked linearity assumption. Later, when we investigate the finite-sample properties, we will show how linearity will help us establish unbiasedness and how normality of the errors can allow us to conduct exact, finite-sample inference. But these assumptions are very strong, so it's vital to understand what we can say about OLS without making them. \n\n## Large-sample properties of OLS\n\nAs we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: a consistent estimate of the variance of $\\bhat$ and the approximate distribution of $\\bhat$ in large samples. Remember that since $\\bhat$ is a vector, the variance of that estimator will actually be a variance-covariance matrix. To obtain these two ingredients, we will first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which will include its variance. \n\n\nWe begin by setting out the assumptions we will need for establishing the large-sample properties of OLS, which are the same as the assumptions needed to ensure that the best linear predictor, $\\bhat = \\E[\\X_{i}\\X_{i}']^{-1}\\E[\\X_{i}Y_{i}]$, is well-defined and unique. \n\n::: {.callout-note}\n\n### Linear projection assumptions\n\nThe linear projection model makes the following assumptions:\n\n1. $\\{(Y_{i}, \\X_{i})\\}_{i=1}^n$ are iid random vectors\n\n2. $\\E[Y^{2}_{i}] < \\infty$ (finite outcome variance)\n\n3. $\\E[\\Vert \\X_{i}\\Vert^{2}] < \\infty$ (finite variances and covariances of covariates)\n\n2. $\\E[\\X_{i}\\X_{i}']$ is positive definite (no linear dependence in the covariates)\n:::\n\n\nRecall that these are mild conditions on the joint distribution of $(Y_{i}, \\X_{i})$ and in particular, we are **not** assuming linearity of the CEF, $\\E[Y_{i} \\mid \\X_{i}]$, nor are we assuming any specific distribution for the data. \n\nWe can helpfully decompose the OLS estimator into the actual BLP coefficient plus estimation error as\n$$ \n\\bhat = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_iY_i \\right) = \\bfbeta + \\underbrace{\\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\right)}_{\\text{estimation error}}.\n$$ \n \nThis decomposition will help us quickly establish the consistency of $\\bhat$. By the law of large numbers, we know that sample means will converge in probability to population expectations, so we have\n$$ \n\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\inprob \\E[\\X_i\\X_i'] \\equiv \\mb{Q}_{\\X\\X} \\qquad \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\inprob \\E[\\X_{i} e_{i}] = \\mb{0},\n$$\nwhich implies that \n$$\n\\bhat \\inprob \\bfbeta + \\mb{Q}_{\\X\\X}^{-1}\\E[\\X_ie_i] = \\bfbeta,\n$$\nby the continuous mapping theorem (the inverse is a continuous function). The linear projection assumptions ensure that LLN applies to these sample means and ensure that $\\E[\\X_{i}\\X_{i}']$ is invertible. \n\n\n::: {#thm-ols-consistency}\nUnder the above linear projection assumptions, the OLS estimator is consistent for the best linear projection coefficients, $\\bhat \\inprob \\bfbeta$.\n:::\n\nThus, OLS should be close to the population linear regression in large samples under relatively mild conditions. Remember that this might not equal the conditional expectation if the CEF is nonlinear. We can say here that OLS converges to the best *linear* approximation to the CEF. Of course, this also means that if the CEF is linear, then OLS will consistently estimate the coefficients of the CEF. \n\nTo emphasize here: the only assumption we made about the dependent variable is that it has finite variance and is iid. Under this assumption, the outcome could be continuous, categorical, binary, or event count. \n\n\nNext, we would like to establish an asymptotic normality result for the OLS coefficients. We first review some key ideas about the central limit theorem.\n\n::: {.callout-note}\n\n## CLT reminder\n\nSuppose that we have a function of the data iid random vectors $\\X_1, \\ldots, \\X_n$, $g(\\X_{i})$ where $\\E[g(\\X_{i})] = 0$ and so $\\V[g(\\X_{i})] = \\E[g(\\X_{i})g(\\X_{i})']$. Then if $\\E[\\Vert g(\\X_{i})\\Vert^{2}] < \\infty$, the CLT implies that\n$$ \n\\sqrt{n}\\left(\\frac{1}{n} \\sum_{i=1}^{n} g(\\X_{i}) - \\E[g(\\X_{i})]\\right) = \\frac{1}{\\sqrt{n}} \\sum_{i=1}^{n} g(\\X_{i}) \\indist \\N(0, \\E[g(\\X_{i})g(\\X_{i}')]) \n$$ {#eq-clt-mean-zero}\n:::\n\nWe now manipulate our decomposition to arrive at the *stabilized* version of the estimator,\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\right).\n$$\nWe have already established that the first term on the right-hand side will converge in probability to $\\mb{Q}_{\\X\\X}^{-1}$. Notice that $\\E[\\X_{i}e_{i}] = 0$, so we can apply @eq-clt-mean-zero to the second term. The covariance matrix of $\\X_ie_{i}$ is \n$$ \n\\mb{\\Omega} = \\V[\\X_{i}e_{i}] = \\E[\\X_{i}e_{i}(\\X_{i}e_{i})'] = \\E[e_{i}^{2}\\X_{i}\\X_{i}'].\n$$ \nThe CLT will imply that\n$$ \n\\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\indist \\N(0, \\mb{\\Omega}).\n$$\nCombining these facts with Slutsky's Theorem implies the following theorem. \n\n::: {#thm-ols-asymptotic-normality}\n\nSuppose that the linear projection assumptions hold and, in addition, we have $\\E[Y_{i}^{4}] < \\infty$ and $\\E[\\lVert\\X_{i}\\rVert^{4}] < \\infty$. Then the OLS estimator is asymptotically normal with\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) \\indist \\N(0, \\mb{V}_{\\bfbeta}),\n$$\nwhere\n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1}.\n$$\n\n:::\n\nThus, if the sample size is large enough, we can approximate the distribution of $\\bhat$ with a multivariate normal with mean $\\bfbeta$ and covariance matrix $\\mb{V}_{\\bfbeta}/n$. In particular, the square root of the $j$th diagonals of this matrix will be standard errors for $\\widehat{\\beta}_j$. Knowing the shape of the OLS estimator's multivariate distribution will allow us to conduct hypothesis tests and generate confidence intervals for both individual coefficients and groups of coefficients. But first, we need an estimate of the covariance matrix!\n\n\n\n## Variance estimation for OLS\n\nThe asymptotic normality of OLS from the last section is of limited value without some way to estimate the covariance matrix, \n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1}.\n$$\nSince each term here is a population mean, this is an ideal place to drop a plug-in estimator. In particular, let's use the following estimators:\n$$ \n\\begin{aligned}\n \\mb{Q}_{\\X\\X} &= \\E[\\X_{i}\\X_{i}'] & \\widehat{\\mb{Q}}_{\\X\\X} &= \\frac{1}{n} \\sum_{i=1}^{n} \\X_{i}\\X_{i}' = \\frac{1}{n}\\Xmat'\\Xmat \\\\\n \\mb{\\Omega} &= \\E[e_i^2\\X_i\\X_i'] & \\widehat{\\mb{\\Omega}} & = \\frac{1}{n}\\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i'.\n\\end{aligned}\n$$\nUnder the assumptions of @thm-ols-asymptotic-normality, the LLN will imply that these are consistent for their targets, $\\widehat{\\mb{Q}}_{\\X\\X} \\inprob \\mb{Q}_{\\X\\X}$ and $\\widehat{\\mb{\\Omega}} \\inprob \\mb{\\Omega}$. We can plug these into the variance formula to arrive at\n$$ \n\\begin{aligned}\n \\widehat{\\mb{V}}_{\\bfbeta} &= \\widehat{\\mb{Q}}_{\\X\\X}^{-1}\\widehat{\\mb{\\Omega}}\\widehat{\\mb{Q}}_{\\X\\X}^{-1} \\\\\n &= \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n\\end{aligned}\n$$\nwhich by the continuous mapping theorem is consistent, $\\widehat{\\mb{V}}_{\\bfbeta} \\inprob \\mb{V}_{\\bfbeta}$. \n\nThis estimator is sometimes called the **robust variance estimator** or, more accurately, the **heteroskedasticity-consistent (HC) variance estimator**. How is this robust? Consider the standard **homoskedasticity** assumption that most statistical software packages make when estimating OLS variances: the variance of the errors does not depend on the covariates: $\\V[e_{i}^{2} \\mid \\X_{i}] = \\V[e_{i}^{2}]$. This assumption is stronger than we need, and we can rely on a weaker assumption that the squared errors are uncorrelated with a specific function of the covariates: \n$$ \n\\E[e_{i}^{2}\\X_{i}\\X_{i}'] = \\E[e_{i}^{2}]\\E[\\X_{i}\\X_{i}'] = \\sigma^{2}\\mb{Q}_{\\X\\X}, \n$$\nwhere $\\sigma^2$ is the variance of the residuals (since $\\E[e_{i}] = 0$). Homoskedasticity simplifies the asymptotic variance of the stabilized estimator, $\\sqrt{n}(\\bhat - \\bfbeta)$, to\n$$ \n\\mb{V}^{\\texttt{lm}}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\sigma^{2}\\mb{Q}_{\\X\\X}\\mb{Q}_{\\X\\X}^{-1} = \\sigma^2\\mb{Q}_{\\X\\X}^{-1}.\n$$\nWe already have an estimator for $\\mb{Q}_{\\X\\X}$, but we need one for $\\sigma^2$. We can easily use the SSR,\n$$ \n\\widehat{\\sigma}^{2} = \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\widehat{e}_{i}^{2},\n$$\nwhere we use $n-k-1$ in the denominator instead of $n$ to correct for the residuals being slightly less variable than the actual errors (because OLS mechanically attempts to make the residuals small). For consistent variance estimation, $n-k -1$ or $n$ can be used, since either way $\\widehat{\\sigma}^2 \\inprob \\sigma^2$. Thus, under homoskedasticity, we have\n$$ \n\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}} = \\widehat{\\sigma}^{2}\\left(\\Xmat'\\Xmat\\right)^{{-1}},\n$$\nwhich is the standard variance estimator used by `lm()` in R or `reg` in Stata. \n\n\nNow that we have two estimators, $\\widehat{\\mb{V}}_{\\bfbeta}$ and $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$, how do they compare? Notice that the HC variance estimator and the homoskedasticity variance estimator will both be consistent when homoskedasticity holds. But as the \"heteroskedasticity-consistent\" label implies, only the HC variance estimator will be consistent when homoskedasticity fails to hold. So $\\widehat{\\mb{V}}_{\\bfbeta}$ has the advantage of being consistent regardless of this assumption. This advantage comes at a cost, however. When homoskedasticity is correct, $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$ incorporates that assumption into the estimator where the HC variance estimator has to estimate it. The HC estimator will have higher variance (the variance estimator will be more variable!) when homoskedasticity actually does hold. \n\n\n\n\n\nNow that we have established the asymptotic normality of the OLS estimator and developed a consistent estimator of its variance, we can proceed with all of the statistical inference tools we discussed in Part I of this guide. Define the estimated **heteroskedasticity-consistent standard errors** as\n$$ \n\\widehat{\\se}(\\widehat{\\beta}_{j}) = \\sqrt{\\frac{[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}}{n}},\n$$\nwhere $[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}$ is the $j$th diagonal entry of the HC variance estimator. Note that we divide by $\\sqrt{n}$ here because $\\widehat{\\mb{V}}_{\\bfbeta}$ is a consistent estimator of the stabilized estimator $\\sqrt{n}(\\bhat - \\bfbeta)$ not the estimator itself. \n\nHypothesis tests and confidence intervals for individual coefficients are almost precisely the same as with the general case presented in Part I. For a two-sided test of $H_0: \\beta_j = b$ versus $H_1: \\beta_j \\neq b$, we can build the t-statistic and conclude that, under the null,\n$$\n\\frac{\\widehat{\\beta}_j - b}{\\widehat{\\se}(\\widehat{\\beta}_{j})} \\indist \\N(0, 1).\n$$\nTypically, statistical software will helpfully provide the t-statistic for the null of no (partial) linear relationship between $X_{ij}$ and $Y_i$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nwhich measures how large the estimated coefficient is in standard errors. With $\\alpha = 0.05$, asymptotic normality would imply that we reject this null when $t > 1.96$. We can form asymptotically-valid confidence intervals with \n$$ \n\\left[\\widehat{\\beta}_{j} - z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j}),\\;\\widehat{\\beta}_{j} + z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j})\\right]. \n$$\nFor reasons we will discuss below, standard software typically relies on the $t$ distribution instead of the normal for hypothesis testing and confidence intervals. Still, this difference is of little consequence in large samples. \n\n## Inference for multiple parameters\n\nWith multiple coefficients, we might have hypotheses that involve more than one coefficient. As an example, let's focus on a regression with an interaction between two covariates, \n$$\nY_i = \\beta_0 + X_i\\beta_1 + Z_i\\beta_2 + X_iZ_i\\beta_3 + e_i.\n$$\nSuppose we wanted to test the hypothesis that $X_i$ does not affect the best linear predictor for $Y_i$. That would be\n$$ \nH_{0}: \\beta_{1} = 0 \\text{ and } \\beta_{3} = 0\\quad\\text{vs}\\quad H_{1}: \\beta_{1} \\neq 0 \\text{ or } \\beta_{3} \\neq 0,\n$$\nwhere we usually write the null more compactly as $H_0: \\beta_1 = \\beta_3 = 0$. \n\nTo test this null hypothesis, we need a test statistic that discriminates the two hypotheses: it should be large when the alternative is true and small when the null is true. With a single coefficient, we usually test the null hypothesis of $H_0: \\beta_j = b_0$ with the $t$-statistic, \n$$ \nt = \\frac{\\widehat{\\beta}_{j} - b_{0}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nand we usually take the absolute value, $|t|$, as our measure of how far our estimate is from the null. But notice that we could also use the square of the $t$ statistic, which is\n$$ \nt^{2} = \\frac{\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{\\V[\\widehat{\\beta}_{j}]} = \\frac{n\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{[\\mb{V}_{\\bfbeta}]_{[jj]}} \n$$ {#eq-squared-t}\n\nSo here's another way to differentiate the null from the alternative: the squared distance between them divided by the variance of the estimate. \n\nCan we generalize this idea to hypotheses about multiple parameters? Adding the sum of squared distances for each component of the null hypothesis is straightforward. For our interaction example, that would be\n$$ \n\\widehat{\\beta}_1^2 + \\widehat{\\beta}_3^2, \n$$\nbut remember that some of the estimated coefficients are noisier than others, so we should account for the uncertainty, just like we did for the $t$-statistic. \n\nWith multiple parameters and multiple coefficients, the variances will now require matrix algebra. We can write any hypothesis about linear functions of the coefficients as $H_{0}: \\mb{L}\\bfbeta = \\mb{c}$. For example, in the interaction case, we have\n$$ \n\\mb{L} =\n\\begin{pmatrix}\n 0 & 1 & 0 & 0 \\\\\n 0 & 0 & 0 & 1 \\\\\n\\end{pmatrix}\n\\qquad\n\\mb{c} =\n\\begin{pmatrix}\n 0 \\\\\n 0\n\\end{pmatrix}\n$$\nThus, $\\mb{L}\\bfbeta = \\mb{0}$ is equivalent to $\\beta_1 = 0$ and $\\beta_3 = 0$. Notice that with other $\\mb{L}$ matrices, we could represent more complicated hypotheses like $2\\beta_1 - \\beta_2 = 34$, though we mostly stick to simpler functions. Let $\\widehat{\\bs{\\theta}} = \\mb{L}\\bhat$ be the OLS estimate of the function of the coefficients. By the delta method (discussed in @sec-delta-method), we have\n$$ \n\\sqrt{n}\\left(\\mb{L}\\bhat - \\mb{L}\\bfbeta\\right) \\indist \\N(0, \\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}).\n$$\nWe can now generalize the squared $t$ statistic in @eq-squared-t. In particular, we will take the distances $\\mb{L}\\bhat - \\mb{c}$ weighted by the variance-covariance matrix $\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}$, \n$$ \nW = n(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}(\\mb{L}\\bhat - \\mb{c}),\n$$\nwhich is called the **Wald test statistic**. This statistic generalizes the ideas of the t-statistic to multiple parameters. With the t-statistic, we recenter to have mean 0 and divide by the standard error to get a variance of 1. If we ignore the middle variance weighting, we have $(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}\\bhat - \\mb{c})$ which is just the sum of the squared deviations of the estimates from the null. Including the $(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}$ weight has the effect of rescaling the distribution of $\\mb{L}\\bhat - \\mb{c}$ to make it rotationally symmetric around 0 (so the resulting dimensions are uncorrelated) with each dimension having an equal variance of 1. In this way, the Wald statistic transforms the random vectors to be mean-centered and have variance 1 (just the t-statistic), but also to have the resulting random variables in the vector be uncorrelated.[^norms]\n\n\n[^norms]: The form of the Wald statistic is that of a weighted inner product, $\\mb{x}'\\mb{Ay}$, where $\\mb{A}$ is a symmetric positive-definite weighting matrix. \n\nWhy transform the data in this way? @fig-wald shows the contour plot of a hypothetical joint distribution of two coefficients from an OLS regression. We might want to know how far different points in the distribution are from the mean, which in this case is $(1, 2)$. Without considering the joint distribution, the circle is obviously closer to the mean than the triangle. However, looking at where the two points are on the distribution, the circle is at a lower contour than the triangle, meaning it is more extreme than the triangle for this particular distribution. The Wald statistic, then, takes into consideration how much of a \"climb\" it is for $\\mb{L}\\bhat$ to get to $\\mb{c}$ given the distribution of $\\mb{L}\\bhat$.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of two slope coefficients. The circle is closer to the center of the distribution by the standard Euclidean distance, but the triangle is closer once you consider the joint distribution.](08_ols_properties_files/figure-html/fig-wald-1.png){#fig-wald width=672}\n:::\n:::\n\n\n\nIf $\\mb{L}$ only has one row, our Wald statistic is the same as the squared $t$ statistic, $W = t^2$. This fact will help us think about the asymptotic distribution of $W$. Notice that as $n\\to\\infty$, we know that by the asymptotic normality of $\\bhat$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{\\widehat{\\se}[\\widehat{\\beta}_{j}]} \\indist \\N(0,1)\n$$\nso $t^2$ will converge in distribution to a $\\chi^2_1$ (since a $\\chi^2_1$ is just one standard normal squared). After recentering and rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\\mb{L}\\bhat = \\mb{c}$, we have $W \\indist \\chi^2_{q}$. \n\n\n::: {.callout-note}\n\n## Chi-squared critical values\n\nWe can obtain critical values for the $\\chi^2_q$ distribution using the `qchisq()` function in R. For example, if we wanted to obtain the critical value $w$ that such that $\\P(W > w_{\\alpha}) = \\alpha$ for our two-parameter interaction example, we could use:\n\n::: {.cell}\n\n```{.r .cell-code}\nqchisq(p = 0.95, df = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5.991465\n```\n:::\n:::\n\n:::\n\n\nWe need to define the rejection region to use the Wald statistic in a test. Because we are squaring each distance in $W \\geq 0$, larger values of $W$ indicate more disagreement with the null in either direction. Thus, for an $\\alpha$-level test of the joint null, we only need a one-sided rejection region of the form $\\P(W > w_{\\alpha}) = \\alpha$. Obtaining these values is straightforward (see the above callout tip). For $q = 2$ and a $\\alpha = 0.05$, the critical value is roughly 6. \n\n\nThe Wald statistic is not a common test provided by standard statistical software functions like `lm()` in R, though it is fairly straightforward to implement \"by hand.\" Alternatively, packages like [`{aod}`](https://cran.r-project.org/web/packages/aod/index.html) or [`{clubSandwich}`](http://jepusto.github.io/clubSandwich/) have implementations of the test. What is reported by most software implementations of OLS (like `lm()` in R) is the F-statistic, which is\n$$ \nF = \\frac{W}{q},\n$$\nwhich also typically uses the homoskedastic variance estimator $\\mb{V}^{\\texttt{lm}}_{\\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution is not really statistically justified, it is slightly more conservative than the $\\chi^2_q$ distribution, and the inference will converge as $n\\to\\infty$. So it might be justified as an *ad hoc* small sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and say we have a sample size of $n = 100$. In that case, the critical value for the F test with $\\alpha = 0.05$ is\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqf(0.95, df1 = 2, df2 = 100 - 4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.091191\n```\n:::\n:::\n\n\nThis result implies a critical value of 6.182 on the scale of the Wald statistic (multiplying it by $q = 2$). Compared to the earlier critical value of 5.991 based on the $\\chi^2_2$ distribution, we can see that the inferences will be very similar even in moderately-sized datasets. \n\nFinally, note that the F-statistic reported by `lm()` in R is the test of all the coefficients except the intercept being 0. In modern quantitative social sciences, this test is seldom substantively interesting. \n\n\n## Finite-sample properties with a linear CEF\n\nAll the above results have been large-sample properties, and we have not addressed finite-sample properties like the sampling variance or unbiasedness. Under the linear projection assumption above, OLS is generally biased without stronger assumptions. This section introduces the stronger assumption that will allow us to establish stronger properties for OLS. As usual, however, remember that these stronger assumptions can be wrong. \n\n\n::: {.callout-note}\n## Assumption: Linear Regression Model\n1. The variables $(Y_{i}, \\X_{i})$ satisfy the linear CEF assumption.\n$$ \n\\begin{aligned}\n Y_{i} &= \\X_{i}'\\bfbeta + e_{i} \\\\\n \\E[e_{i}\\mid \\X_{i}] & = 0.\n\\end{aligned}\n$$\n\n2. The design matrix is invertible $\\E[\\X_{i}\\X_{i}'] > 0$ (positive definite).\n:::\n\n\nWe discussed the concept of a linear CEF extensively in @sec-regression. However, recall that the CEF might be linear mechanically if the model is **saturated** or when there are as many coefficients in the model as there are unique values of $\\X_i$. When a model is not saturated, the linear CEF assumption is just that: an assumption. What can this assumption do? It can actually establish quite a few nice statistical properties in finite samples. \n\nOne note before we proceed. When focusing on the finite sample inference for OLS, it is customary to focus on its properties **conditional on the observed covariates**, such as $\\E[\\bhat \\mid \\Xmat]$ or $\\V[\\bhat \\mid \\Xmat]$. The historical reason for this was that the researcher often chose these independent variables, so they were not random. Thus, you'll sometimes see $\\Xmat$ treated as \"fixed\" in some older texts, and they might even omit explicit conditioning statements. \n\n\n::: {#thm-ols-unbiased}\n\nUnder the linear regression model assumption, OLS is unbiased for the population regression coefficients, \n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta,\n$$\nand its conditional sampling variance is\n$$\n\\mb{\\V}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nwhere $\\sigma^2_{i} = \\E[e_{i}^{2} \\mid \\Xmat]$. \n:::\n\n\n::: {.proof}\n\nTo prove the conditional unbiasedness, recall that we can write the OLS estimator as\n$$\n\\bhat = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e},\n$$\nand so taking (conditional) expectations, we have,\n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta + \\E[(\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\E[\\mb{e} \\mid \\Xmat] = \\bfbeta,\n$$\nbecause under the linear CEF assumption $\\E[\\mb{e}\\mid \\Xmat] = 0$. \n\nFor the conditional sampling variance, we can use the same decomposition we have,\n$$\n\\V[\\bhat \\mid \\Xmat] = \\V[\\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = (\\Xmat'\\Xmat)^{-1}\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat(\\Xmat'\\Xmat)^{-1}. \n$$\nSince $\\E[\\mb{e}\\mid \\Xmat] = 0$, we know that $\\V[\\mb{e}\\mid \\Xmat] = \\E[\\mb{ee}' \\mid \\Xmat]$, which is a matrix with diagonal entries $\\E[e_{i}^{2} \\mid \\Xmat] = \\sigma^2_i$ and off-diagonal entries $\\E[e_{i}e_{j} \\Xmat] = \\E[e_{i}\\mid \\Xmat]\\E[e_{j}\\mid\\Xmat] = 0$, where the first equality follows from the independence of the errors across units. Thus, $\\V[\\mb{e} \\mid \\Xmat]$ is a diagonal matrix with $\\sigma^2_i$ along the diagonal, which means\n$$\n\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat = \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i',\n$$\nestablishing the conditional sampling variance.\n \n:::\n\nThus, for any realization of the covariates, $\\Xmat$, OLS is unbiased for the true regression coefficients $\\bfbeta$. By the law of iterated expectation, we also know that it is unconditionally unbiased[^unconditional] as well since\n$$\n\\E[\\bhat] = \\E[\\E[\\bhat \\mid \\Xmat]] = \\bfbeta. \n$$\nThe difference between these two statements usually isn't incredibly meaningful. \n\n[^unconditional]: We are basically ignoring some edge cases when it comes to discrete covariates here. In particular, we assume that $\\Xmat'\\Xmat$ is nonsingular with probability one. However, this can fail if we have a binary covariate since there is some chance (however slight) that the entire column will be all ones or all zeros, which would lead to a singular matrix $\\Xmat'\\Xmat$. Practically this is not a big deal, but it does mean that we have to ignore this issue theoretically or focus on conditional unbiasedness. \n\n\nThere are a lot of variances flying around, so it's helpful to review them. Above, we derived the asymptotic variance of $\\mb{Z}_{n} = \\sqrt{n}(\\bhat - \\bfbeta)$, \n$$\n\\mb{V}_{\\bfbeta} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1},\n$$\nwhich implies that the approximate variance of $\\bhat$ will be $\\mb{V}_{\\bfbeta} / n$ because\n$$\n\\bhat = \\frac{Z_n}{\\sqrt{n}} + \\bfbeta \\quad\\implies\\quad \\bhat \\overset{a}{\\sim} \\N(\\bfbeta, n^{-1}\\mb{V}_{\\bfbeta}),\n$$\nwhere $\\overset{a}{\\sim}$ means approximately asymptotically distributed as. Under the linear CEF, the conditional sampling variance of $\\bhat$ has a similar form and will be similar to the \n$$\n\\mb{V}_{\\bhat} = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\approx \\mb{V}_{\\bfbeta} / n\n$$\nIn practice, these two derivations lead to basically the same variance estimator. Recall the heteroskedastic-consistent variance estimator\n$$\n\\widehat{\\mb{V}}_{\\bfbeta} = \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n$$\nis a valid plug-in estimator for the asymptotic variance and\n$$\n\\widehat{\\mb{V}}_{\\bhat} = n^{-1}\\widehat{\\mb{V}}_{\\bfbeta}.\n$$\nThus, in practice, the asymptotic and finite-sample results under a linear CEF justify the same variance estimator. \n\n\n### Linear CEF model under homoskedasticity\n\nIf we are willing to make a homoskedasticity assumption on the errors, we can derive even stronger results for OLS. Stronger assumptions typically lead to stronger conclusions, but those conclusions may not be robust to assumption violations. But homoskedasticity is such a historically important assumption that statistical software implementations of OLS like `lm()` in R assume it. \n\n::: {.callout-note}\n\n## Assumption: Homoskedasticity with a linear CEF\n\nIn addition to the linear CEF assumption, we further assume that\n$$\n\\E[e_i^2 \\mid \\X_i] = \\E[e_i^2] = \\sigma^2,\n$$\nor that variance of the errors does not depend on the covariates. \n:::\n\n\n::: {#thm-homoskedasticity}\n\nUnder a linear CEF model with homoskedastic errors, the conditional sampling variance is\n$$\n\\mb{V}^{\\texttt{lm}}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\sigma^2 \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nand the variance estimator \n$$\n\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} = \\widehat{\\sigma}^2 \\left( \\Xmat'\\Xmat \\right)^{-1} \\quad\\text{where,}\\quad \\widehat{\\sigma}^2 = \\frac{1}{n - k - 1} \\sum_{i=1}^n \\widehat{e}_i^2\n$$\nis unbiased, $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n::: \n\n::: {.proof}\nUnder homoskedasticity $\\sigma^2_i = \\sigma^2$ for all $i$. Recall that $\\sum_{i=1}^n \\X_i\\X_i' = \\Xmat'\\Xmat$. Thus, the conditional sampling variance from @thm-ols-unbiased, \n$$ \n\\begin{aligned}\n\\V[\\bhat \\mid \\Xmat] &= \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2 \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\ &= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\Xmat'\\Xmat \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1} = \\mb{V}^{\\texttt{lm}}_{\\bhat}.\n\\end{aligned}\n$$\n\nFor unbiasedness, we just need to show that $\\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] = \\sigma^2$. Recall that we defined $\\mb{M}_{\\Xmat}$ as the residual-maker because $\\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}}$. We can use this to connect the residuals to the errors,\n$$ \n\\mb{M}_{\\Xmat}\\mb{e} = \\mb{M}_{\\Xmat}\\mb{Y} - \\mb{M}_{\\Xmat}\\Xmat\\bfbeta = \\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}},\n$$ \nso \n$$\n\\V[\\widehat{\\mb{e}} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\V[\\mb{e} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\sigma^2,\n$$\nwhere the first equality is because $\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'$ is constant conditional on $\\Xmat$. Notice that the diagonal entries of this matrix are the variances of particular residuals $\\widehat{e}_i$ and that the diagonal entries of the annihilator matrix are $1 - h_{ii}$ (since the $h_{ii}$ are the diagonal entries of $\\mb{P}_{\\Xmat}$). Thus, we have\n$$ \n\\V[\\widehat{e}_i \\mid \\Xmat] = \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] = (1 - h_{ii})\\sigma^{2}.\n$$\nIn the last chapter, we established one property of these leverage values in @sec-leverage, namely $\\sum_{i=1}^n h_{ii} = k+ 1$, so $\\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have\n$$ \n\\begin{aligned}\n \\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] &= \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] \\\\\n &= \\frac{\\sigma^{2}}{n-k-1} \\sum_{i=1}^{n} 1 - h_{ii} \\\\\n &= \\sigma^{2}. \n\\end{aligned}\n$$\nThis establishes $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n\n:::\n\n\nThus, under the linear CEF model and homoskedasticity of the errors, we have an unbiased variance estimator that is a simple function of the sum of squared residuals and the design matrix. Most statistical software packages estimate standard errors using $\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}$. \n\n\nThe final result we can derive for the linear CEF under homoskedasticity is an optimality result. We might ask ourselves if there is another estimator for $\\bfbeta$ that would outperform OLS in the sense of having a lower sampling variance. Perhaps surprisingly, no linear estimator for $\\bfbeta$ has a lower conditional variance, meaning that OLS is the **best linear unbiased estimator**, often jovially shortened to BLUE. This result is famously known as the Gauss-Markov Theorem.\n\n::: {#thm-gauss-markov}\n\nLet $\\widetilde{\\bfbeta} = \\mb{AY}$ be a linear and unbiased estimator for $\\bfbeta$. Under the linear CEF model with homoskedastic errors, \n$$\n\\V[\\widetilde{\\bfbeta}\\mid \\Xmat] \\geq \\V[\\bhat \\mid \\Xmat]. \n$$\n\n:::\n\n::: {.proof}\nNote that if $\\widetilde{\\bfbeta}$ is unbiased then $\\E[\\widetilde{\\bfbeta} \\mid \\Xmat] = \\bfbeta$ and so \n$$\n\\bfbeta = \\E[\\mb{AY} \\mid \\Xmat] = \\mb{A}\\E[\\mb{Y} \\mid \\Xmat] = \\mb{A}\\Xmat\\bfbeta,\n$$\nwhich implies that $\\mb{A}\\Xmat = \\mb{I}_n$. \nRewrite the competitor as $\\widetilde{\\bfbeta} = \\bhat + \\mb{BY}$ where,\n$$ \n\\mb{B} = \\mb{A} - \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'.\n$$\nand note that $\\mb{A}\\Xmat = \\mb{I}_n$ implies that $\\mb{B}\\Xmat = 0$. We now have\n$$ \n\\begin{aligned}\n \\widetilde{\\bfbeta} &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{Y} \\\\\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\mb{B}\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e}\n\\end{aligned}\n$$\nThe variance of the competitor is, thus, \n$$ \n\\begin{aligned}\n \\V[\\widetilde{\\bfbeta} \\mid \\Xmat]\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\V[\\mb{e}\\mid \\Xmat]\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)' \\\\\n &= \\sigma^{2}\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\left( \\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{B}'\\right) \\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\mb{B}' + \\mb{B}\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &\\geq \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1} \\\\\n &= \\V[\\bhat \\mid \\Xmat]\n\\end{aligned}\n$$\nThe first equality comes from the properties of covariance matrices; the second is due to homoskedasticity; the fourth is due to $\\mb{B}\\Xmat = 0$, which implies that $\\Xmat'\\mb{B}' = 0$ as well. The fifth inequality holds because matrix products of the form $\\mb{BB}'$ are positive definite if $\\mb{B}$ is of full rank (which we have assumed it is). \n\n:::\n\nIn this proof, we saw that the variance of the competing estimator had variance $\\sigma^2\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)$ which we argued was \"greater than 0\" in the matrix sense, which is also called positive definite. What does this mean practically? Remember that any positive definite matrix must have strictly positive diagonal entries and that the diagonal entries of $\\V[\\bhat \\mid \\Xmat]$ and $V[\\widetilde{\\bfbeta}\\mid \\Xmat]$ are the variances of the individual parameters, $\\V[\\widehat{\\beta}_{j} \\mid \\Xmat]$ and $\\V[\\widetilde{\\beta}_{j} \\mid \\Xmat]$. Thus, the variances of the individual parameters will be larger for $\\widetilde{\\bfbeta}$ than for $\\bhat$.\n\nMany textbooks cite the Gauss-Markov theorem as a critical advantage of OLS over other methods, but it's essential to recognize its limitations. It requires linearity and homoskedastic error assumptions, which can be false in many applications. \n\nFinally, note that while we have shown this result for linear estimators, @Hansen22 proves a more general version of this result that applies to any unbiased estimator. \n\n## The normal linear model\n\nFinally, we add the strongest and thus least loved of the classical linear regression assumption: (conditional) normality of the errors. The historical reason to use this assumption was that finite-sample inference hits a roadblock without some knowledge of the sampling distribution of $\\bhat$. Under the linear CEF model, we saw that it was unbiased, and under homoskedasticity, we could produce an unbiased estimator of the conditional variance. But to do hypothesis testing or generate confidence intervals, we need to be able to make probability statements about the estimator, and for that, we need to know its exact distribution. When the sample size is large, we can rely on the CLT and know it is approximately normal. But in small samples, what do we do? Historically, we decided to assume (conditional) normality of the errors to proceed with some knowledge that we were wrong but hopefully not too wrong. \n\n\n::: {.callout-note}\n\n## The normal linear regression model\n\nIn addition to the linear CEF assumption, we assume that \n$$\ne_i \\mid \\Xmat \\sim \\N(0, \\sigma^2).\n$$\n\n:::\n\nA couple of things to point out: \n\n- The assumption here is not that $(Y_{i}, \\X_{i})$ are jointly normal (though this would be sufficient for the assumption to hold), but rather that $Y_i$ is normally distributed conditional on $\\X_i$. \n- Notice that the normal regression model has the homoskedasticity assumption baked in. \n\n::: {#thm-normal-ols}\n\nUnder the normal linear regression model, we have\n$$ \n\\begin{aligned}\n \\bhat \\mid \\Xmat &\\sim \\N\\left(\\bfbeta, \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1}\\right) \\\\\n \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}]_{jj}/\\sqrt{n}} &\\sim t_{n-k-1} \\\\\n W/q &\\sim F_{q, n-k-1}. \n\\end{aligned}\n$$\n\n:::\n\n\nThis theorem says that in the normal linear regression model, the coefficients follow a normal distribution, the t-statistics follow a $t$-distribution, and a transformation of the Wald statistic follows an $F$ distribution. These are **exact** results and do not rely on large-sample approximations. Under the assumption of conditional normality of the errors, they are as valid for $n = 5$ as for $n = 500,000$. \n\nFew people believe errors follow a normal distribution, so why even present these results? Unfortunately, most statistical software implementations of OLS implicitly assume this when calculating p-values for tests or constructing confidence intervals. That is, the p-value associated with the $t$-statistic that `lm()` outputs in R relies on the $t_{n-k-1}$ distribution, and the critical values used to construct confidence intervals with `confint()` use that distribution as well. When normality does not hold, there is no principled reason to use the $t$ or the $F$ distributions in this way. But we might hold our nose and use this *ad hoc* procedure under two rationalizations:\n\n- $\\bhat$ is asymptotically normal, but this approximation might be poor in smaller finite samples. The $t$ distribution will make inference more conservative in these cases (wider confidence intervals, smaller test rejection regions), which might help offset the poor approximation of the normal in small samples. \n- As $n\\to\\infty$, the $t_{n-k-1}$ will converge to a standard normal distribution, so the *ad hoc* adjustment will not matter much for medium to large samples. \n\nThese arguments are not very convincing since it's unclear whether the $t$ approximation will be any better than the normal in finite samples. But it's the best we can do to console ourselves as we find more data. \n", "supporting": [ "08_ols_properties_files/figure-html" ], diff --git a/_freeze/08_ols_properties/execute-results/tex.json b/_freeze/08_ols_properties/execute-results/tex.json index 586e02e..6ce6590 100644 --- a/_freeze/08_ols_properties/execute-results/tex.json +++ b/_freeze/08_ols_properties/execute-results/tex.json @@ -1,7 +1,7 @@ { - "hash": "2c4a143aed2c71fb6dce898bef9e774a", + "hash": "92c14d8fe0bbdcbbdcc4b09f07ed1ada", "result": { - "markdown": "# The statistics of least squares\n\nIn the last chapter, we derived the least squares estimator and investigated many of its mechanical properties. These properties are essential for the practical application of OLS. Still, we should also understand its statistical properties, such as the ones described in Part I: unbiasedness, sampling variance, consistency, and asymptotic normality. As we saw then, these properties fall into finite-sample (unbiasedness, sampling variance) and asymptotic (consistency, asymptotic normality). \n\nIn this chapter, we will focus first on the asymptotic properties of OLS because those properties hold under the relatively mild conditions of the linear projection model introduced in @sec-linear-projection. We will see that OLS consistently estimates a coherent quantity of interest (the best linear predictor) regardless of whether the conditional expectation is linear. That is, for the asymptotic properties of the estimator, we will not need the commonly invoked linearity assumption. Later, when we investigate the finite-sample properties, we will show how linearity will help us establish unbiasedness and how normality of the errors can allow us to conduct exact, finite-sample inference. But these assumptions are very strong, so it's vital to understand what we can say about OLS without making them. \n\n## Large-sample properties of OLS\n\nAs we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: a consistent estimate of the variance of $\\bhat$ and the approximate distribution of $\\bhat$ in large samples. Remember that since $\\bhat$ is a vector, then the variance of that estimator will actually be a variance-covariance matrix. To obtain these two ingredients, we will first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which will include its variance. \n\n\nWe begin by setting out the assumptions we will need for establishing the large-sample properties of OLS, which are the same as the assumptions needed to ensure that the best linear predictor, $\\bhat = \\E[\\X_{i}\\X_{i}']^{-1}\\E[\\X_{i}Y_{i}]$, is well-defined and unique. \n\n::: {.callout-note}\n\n### Linear projection assumptions\n\nThe linear projection model makes the following assumptions:\n\n1. $\\{(Y_{i}, \\X_{i})\\}_{i=1}^n$ are iid random vectors. \n\n2. $\\E[Y_{i}^{2}] < \\infty$ (finite outcome variance)\n\n3. $\\E[\\Vert \\X_{i}\\Vert^{2}] < \\infty$ (finite variances and covariances of covariates)\n\n2. $\\E[\\X_{i}\\X_{i}']$ is positive definite (no linear dependence in the covariates)\n:::\n\n\nRecall that these are mild conditions on the joint distribution of $(Y_{i}, \\X_{i})$ and in particular, we are **not** assuming linearity of the CEF, $\\E[Y_{i} \\mid \\X_{i}]$, nor are we assuming any specific distribution for the data. \n\nWe can helpfully decompose the OLS estimator into the actual BLP coefficient plus estimation error as\n$$ \n\\bhat = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_iY_i \\right) = \\bfbeta + \\underbrace{\\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\right)}_{\\text{estimation error}}.\n$$ \n \nThis decomposition will help us quickly establish the consistency of $\\bhat$. By the law of large numbers, we know that sample means will converge in probability to population expectations, so we have\n$$ \n\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\inprob \\E[\\X_i\\X_i'] \\equiv \\mb{Q}_{\\X\\X} \\qquad \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\inprob \\E[\\X_{i} e_{i}] = \\mb{0},\n$$\nwhich implies that \n$$\n\\bhat \\inprob \\beta + \\mb{Q}_{\\X\\X}^{-1}\\E[\\X_ie_i] = \\beta,\n$$\nby the continuous mapping theorem (the inverse is a continuous function). The linear projection assumptions ensure that LLN applies to these sample means and ensure that $\\E[\\X_{i}\\X_{i}']$ is invertible. \n\n\n::: {#thm-ols-consistency}\nUnder the above linear projection assumptions, the OLS estimator is consistent for the best linear projection coefficients, $\\bhat \\inprob \\bfbeta$.\n:::\n\nThus, OLS should be close to the population linear regression in large samples under relatively mild conditions. Remember that this might not equal the conditional expectation if the CEF is nonlinear. We can say here that OLS converges to the best *linear* approximation to the CEF. Of course, this also means that if the CEF is linear, then OLS will consistently estimate the coefficients of the CEF. \n\nTo emphasize here: the only assumption we made about the dependent variable is that it has finite variance and is iid. Under this assumption, the outcome could be continuous, categorical, binary, or event count. \n\n\nNext, we would like to establish an asymptotic normality result for the OLS coefficients. We first review some key ideas about the central limit theorem.\n\n::: {.callout-note}\n\n## CLT reminder\n\nSuppose that we have a function of the data iid random vectors $\\X_1, \\ldots, \\X_n$, $g(\\X_{i})$ where $\\E[g(\\X_{i})] = 0$ and so $\\V[g(\\X_{i})] = \\E[g(\\X_{i})g(\\X_{i})']$. Then if $\\E[\\Vert g(\\X_{i})\\Vert^{2}] < \\infty$, the CLT implies that\n$$ \n\\sqrt{n}\\left(\\frac{1}{n} \\sum_{i=1}^{n} g(\\X_{i}) - \\E[g(\\X_{i})]\\right) = \\frac{1}{\\sqrt{n}} \\sum_{i=1}^{n} g(\\X_{i}) \\indist \\N(0, \\E[g(\\X_{i})g(\\X_{i}')]) \n$$ {#eq-clt-mean-zero}\n:::\n\nWe now manipulate our decomposition to arrive at the *stabilized* version of the estimator,\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\right).\n$$\nWe have already established that the first term on the right-hand side will converge in probability to $\\mb{Q}_{\\X\\X}^{-1}$. Notice that $\\E[\\X_{i}e_{i}] = 0$, so we can apply @eq-clt-mean-zero to the second term. The covariance matrix of $\\X_ie_{i}$ is \n$$ \n\\mb{\\Omega} = \\V[\\X_{i}e_{i}] = \\E[\\X_{i}e_{i}(\\X_{i}e_{i})'] = \\E[e_{i}^{2}\\X_{i}\\X_{i}'].\n$$ \nThe CLT will imply that\n$$ \n\\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\indist \\N(0, \\mb{\\Omega}).\n$$\nCombining these facts with Slutsky's Theorem implies the following theorem. \n\n::: {#thm-ols-asymptotic-normality}\n\nSuppose that the linear projection assumptions hold and, in addition, we have $\\E[Y_{i}^{4}] < \\infty$ and $\\E[\\lVert\\X_{i}\\rVert^{4}] < \\infty$. Then the OLS estimator is asymptotically normal with\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) \\indist \\N(0, \\mb{V}_{\\bfbeta}),\n$$\nwhere\n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1}.\n$$\n\n:::\n\nThus, if the sample size is large enough, we can approximate the distribution of $\\bhat$ with a multivariate normal with mean $\\bfbeta$ and covariance matrix $\\mb{V}_{\\bfbeta}/n$. In particular, the square root of the $j$th diagonals of this matrix will be standard errors for $\\widehat{\\beta}_j$. Knowing the shape of the OLS estimator's multivariate distribution will allow us to conduct hypothesis tests and generate confidence intervals for both individual coefficients and groups of coefficients. But first, we need an estimate of the covariance matrix!\n\n\n\n## Variance estimation for OLS\n\nThe asymptotic normality of OLS from the last section is of limited value without some way to estimate the covariance matrix, \n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1}.\n$$\nSince each term here is a population mean, this is an ideal place to drop a plug-in estimator. In particular, let's use the following estimators:\n$$ \n\\begin{aligned}\n \\mb{Q}_{\\X\\X} &= \\E[\\X_{i}\\X_{i}'] & \\widehat{\\mb{Q}}_{\\X\\X} &= \\frac{1}{n} \\sum_{i=1}^{n} \\X_{i}\\X_{i}' = \\frac{1}{n}\\Xmat'\\Xmat \\\\\n \\mb{\\Omega} &= \\E[e_i^2\\X_i\\X_i'] & \\widehat{\\mb{\\Omega}} & = \\frac{1}{n}\\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i'.\n\\end{aligned}\n$$\nUnder the assumptions of @thm-ols-asymptotic-normality, the LLN will imply that these are consistent for their targets, $\\widehat{\\mb{Q}}_{\\X\\X} \\inprob \\mb{Q}_{\\X\\X}$ and $\\widehat{\\mb{\\Omega}} \\inprob \\mb{\\Omega}$. We can plug these into the variance formula to arrive at\n$$ \n\\begin{aligned}\n \\widehat{\\mb{V}}_{\\bfbeta} &= \\widehat{\\mb{Q}}_{\\X\\X}^{-1}\\widehat{\\mb{\\Omega}}\\widehat{\\mb{Q}}_{\\X\\X}^{-1} \\\\\n &= \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n\\end{aligned}\n$$\nwhich by the continuous mapping theorem is consistent, $\\widehat{\\mb{V}}_{\\bfbeta} \\inprob \\mb{V}_{\\bfbeta}$. \n\nThis estimator is sometimes called the **robust variance estimator** or, more accurately, the **heteroskedasticity-consistent (HC) variance estimator**. How is this robust? Consider the standard **homoskedasticity** assumption that most statistical software packages make when estimating OLS variances: the variance of the errors does not depend on the covariates: $\\V[e_{i}^{2} \\mid \\X_{i}] = \\V[e_{i}^{2}]$. This assumption is stronger than we need, and we can rely on a weaker assumption that the squared errors are uncorrelated with a specific function of the covariates: \n$$ \n\\E[e_{i}^{2}\\X_{i}\\X_{i}'] = \\E[e_{i}^{2}]\\E[\\X_{i}\\X_{i}'] = \\sigma^{2}\\mb{Q}_{\\X\\X}, \n$$\nwhere $\\sigma^2$ is the variance of the residuals (since $\\E[e_{i}] = 0$). Homoskedasticity simplifies the asymptotic variance of the stabilized estimator, $\\sqrt{n}(\\bhat - \\bfbeta)$, to\n$$ \n\\mb{V}^{\\texttt{lm}}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\sigma^{2}\\mb{Q}_{\\X\\X}\\mb{Q}_{\\X\\X}^{-1} = \\sigma^2\\mb{Q}_{\\X\\X}^{-1}.\n$$\nWe already have an estimator for $\\mb{Q}_{\\X\\X}$, but we need one for $\\sigma^2$. We can easily use the SSR,\n$$ \n\\widehat{\\sigma}^{2} = \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\widehat{e}_{i}^{2},\n$$\nwhere we use $n-k-1$ in the denominator instead of $n$ to correct for the residuals being slightly less variable than the actual errors (because OLS mechanically attempts to make the residuals small). For consistent variance estimation, $n-k -1$ or $n$ can be used, since either way $\\widehat{\\sigma}^2 \\inprob \\sigma^2$. Thus, under homoskedasticity, we have\n$$ \n\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}} = \\widehat{\\sigma}^{2}\\left(\\Xmat'\\Xmat\\right)^{{-1}},\n$$\nwhich is the standard variance estimator used by `lm()` in R or `reg` in Stata. \n\n\nNow that we have two estimators, $\\widehat{\\mb{V}}_{\\bfbeta}$ and $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$, how do they compare? Notice that the HC variance estimator and the homoskedasticity variance estimator will both be consistent when homoskedasticity holds. But as the \"heteroskedasticity-consistent\" label implies, only the HC variance estimator will be consistent when homoskedasticity fails to hold. So $\\widehat{\\mb{V}}_{\\bfbeta}$ has the advantage of being consistent regardless of this assumption. This advantage comes at a cost, however. When homoskedasticity is correct, $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$ incorporates that assumption into the estimator where the HC variance estimator has to estimate it. The HC estimator will have higher variance (the variance estimator will be more variable!) when homoskedasticity actually does hold. \n\n\n\n\n\nNow that we have established the asymptotic normality of the OLS estimator and developed a consistent estimator of its variance, we can proceed with all of the statistical inference tools we discussed in Part I of this guide. Define the estimated **heteroskedasticity-consistent standard errors** as\n$$ \n\\widehat{\\se}(\\widehat{\\beta}_{j}) = \\sqrt{\\frac{[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}}{n}},\n$$\nwhere $[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}$ is the $j$th diagonal entry of the HC variance estimator. Note that we divide by $\\sqrt{n}$ here because $\\widehat{\\mb{V}}_{\\bfbeta}$ is a consistent estimator of the stabilized estimator $\\sqrt{n}(\\bhat - \\bfbeta)$ not the estimator itself. \n\nHypothesis tests and confidence intervals for individual coefficients are almost precisely the same as with the general case presented in Part I. For a two-sided test of $H_0: \\beta_j = b$ versus $H_1: \\beta_j \\neq b$, we can build the t-statistic and conclude that, under the null,\n$$\n\\frac{\\widehat{\\beta}_j - b}{\\widehat{\\se}(\\widehat{\\beta}_{j})} \\indist \\N(0, 1).\n$$\nTypically, statistical software will helpfully provide the t-statistic for the null of no (partial) linear relationship between $X_{ij}$ and $Y_i$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nwhich measures how large the estimated coefficient is in standard errors. With $\\alpha = 0.05$, asymptotic normality would imply that we reject this null when $t > 1.96$. We can form asymptotically-valid confidence intervals with \n$$ \n\\left[\\widehat{\\beta}_{j} - z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j}),\\;\\widehat{\\beta}_{j} + z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j})\\right]. \n$$\nFor reasons we will discuss below, standard software typically relies on the $t$ distribution instead of the normal for hypothesis testing and confidence intervals. Still, this difference is of little consequence in large samples. \n\n## Inference for multiple parameters\n\nWith multiple coefficients, we might have hypotheses that involve more than one coefficient. As an example, let's focus on a regression with an interaction between two covariates, \n$$\nY_i = \\beta_0 + X_i\\beta_1 + Z_i\\beta_2 + X_iZ_i\\beta_3 + e_i.\n$$\nSuppose we wanted to test the hypothesis that $X_i$ does not affect the best linear predictor for $Y_i$. That would be\n$$ \nH_{0}: \\beta_{1} = 0 \\text{ and } \\beta_{3} = 0\\quad\\text{vs}\\quad H_{1}: \\beta_{1} \\neq 0 \\text{ or } \\beta_{3} \\neq 0,\n$$\nwhere we usually write the null more compactly as $H_0: \\beta_1 = \\beta_3 = 0$. \n\nTo test this null hypothesis, we need a test statistic that discriminates the two hypotheses: it should be large when the alternative is true and small when the null is true. With a single coefficient, we usually test the null hypothesis of $H_0: \\beta_j = b_0$ with the $t$-statistic, \n$$ \nt = \\frac{\\widehat{\\beta}_{j} - b_{0}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nand we usually take the absolute value, $|t|$, as our measure of how far our estimate is from the null. But notice that we could also use the square of the $t$ statistic, which is\n$$ \nt^{2} = \\frac{\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{\\V[\\widehat{\\beta}_{j}]} = \\frac{n\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{[\\mb{V}_{\\bfbeta}]_{[jj]}} \n$$ {#eq-squared-t}\n\nSo here's another way to differentiate the null from the alternative: the squared distance between them divided by the variance of the estimate. \n\nCan we generalize this idea to hypotheses about multiple parameters? Adding the sum of squared distances for each component of the null hypothesis is straightforward. For our interaction example, that would be\n$$ \n\\widehat{\\beta}_1^2 + \\widehat{\\beta}_3^2, \n$$\nbut remember that some of the estimated coefficients are noisier than others, so we should account for the uncertainty, just like we did for the $t$-statistic. \n\nWith multiple parameters and multiple coefficients, the variances will now require matrix algebra. We can write any hypothesis about linear functions of the coefficients as $H_{0}: \\mb{L}\\bfbeta = \\mb{c}$. For example, in the interaction case, we have\n$$ \n\\mb{L} =\n\\begin{pmatrix}\n 0 & 1 & 0 & 0 \\\\\n 0 & 0 & 0 & 1 \\\\\n\\end{pmatrix}\n\\qquad\n\\mb{c} =\n\\begin{pmatrix}\n 0 \\\\\n 0\n\\end{pmatrix}\n$$\nThus, $\\mb{L}\\bfbeta = \\mb{0}$ is equivalent to $\\beta_1 = 0$ and $\\beta_3 = 0$. Notice that with other $\\mb{L}$ matrices, we could represent more complicated hypotheses like $2\\beta_1 - \\beta_2 = 34$, though we mostly stick to simpler functions. Let $\\widehat{\\bs{\\theta}} = \\mb{L}\\bhat$ be the OLS estimate of the function of the coefficients. By the delta method (discussed in @sec-delta-method), we have\n$$ \n\\sqrt{n}\\left(\\mb{L}\\bhat - \\mb{L}\\bfbeta\\right) \\indist \\N(0, \\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}).\n$$\nWe can now generalize the squared $t$ statistic in @eq-squared-t. In particular, we will take the distances $\\mb{L}\\bhat - \\mb{c}$ weighted by the variance-covariance matrix $\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}$, \n$$ \nW = n(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}(\\mb{L}\\bhat - \\mb{c}),\n$$\nwhich is called the **Wald test statistic**. This statistic generalizes the ideas of the t-statistic to multiple parameters. With the t-statistic, we recenter to have mean 0 and divide by the standard error to get a variance of 1. If we ignore the middle variance weighting, we have $(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}\\bhat - \\mb{c})$ which is just the sum of the squared deviations of the estimates from the null. Including the $(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}$ weight has the effect of rescaling the distribution of $\\mb{L}\\bhat - \\mb{c}$ to make it rotationally symmetric around 0 (so the resulting dimensions are uncorrelated) with each dimension having an equal variance of 1. In this way, the Wald statistic transforms the random vectors to be mean-centered and have variance 1 (just the t-statistic), but also to have the resulting random variables in the vector be uncorrelated.[^norms]\n\n\n[^norms]: The form of the Wald statistic is that of a weighted inner product, $\\mb{x}'\\mb{Ay}$, where $\\mb{A}$ is a symmetric positive-definite weighting matrix. \n\nWhy transform the data in this way? @fig-wald shows the contour plot of a hypothetical joint distribution of two coefficients from an OLS regression. We might want to know how far different points in the distribution are from the mean, which in this case is $(1, 2)$. Without considering the joint distribution, the circle is obviously closer to the mean than the triangle. However, looking at where the two points are on the distribution, the circle is at a lower contour than the triangle, meaning it is more extreme than the triangle for this particular distribution. The Wald statistic, then, takes into consideration how much of a \"climb\" it is for $\\mb{L}\\bhat$ to get to $\\mb{c}$ given the distribution of $\\mb{L}\\bhat$.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of two slope coefficients. The circle is closer to the center of the distribution by the standard Euclidean distance, but the triangle is closer once you consider the joint distribution.](08_ols_properties_files/figure-pdf/fig-wald-1.pdf){#fig-wald}\n:::\n:::\n\n\n\n\nIf $\\mb{L}$ only has one row, our Wald statistic is the same as the squared $t$ statistic, $W = t^2$. This fact will help us think about the asymptotic distribution of $W$. Notice that as $n\\to\\infty$, we know that by the asymptotic normality of $\\bhat$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{\\widehat{\\se}[\\widehat{\\beta}_{j}]} \\indist \\N(0,1)\n$$\nso $t^2$ will converge in distribution to a $\\chi^2_1$ (since a $\\chi^2_1$ is just one standard normal squared). After recentering ad rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\\mb{L}\\bhat = \\mb{c}$, we have $W \\indist \\chi^2_{q}$. \n\n\n::: {.callout-note}\n\n## Chi-squared critical values\n\nWe can obtain critical values for the $\\chi^2_q$ distribution using the `qchisq()` function in R. For example, if we wanted to obtain the critical value $w$ that such that $\\P(W > w_{\\alpha}) = \\alpha$ for our two-parameter interaction example, we could use:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqchisq(p = 0.95, df = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5.991465\n```\n:::\n:::\n\n\n:::\n\n\nWe need to define the rejection region to use the Wald statistic in a test. Because we are squaring each distance in $W \\geq 0$, larger values of $W$ indicate more disagreement with the null in either direction. Thus, for an $\\alpha$-level test of the joint null, we only need a one-sided rejection region of the form $\\P(W > w_{\\alpha}) = \\alpha$. Obtaining these values is straightforward (see the above callout tip). For $q = 2$ and a $\\alpha = 0.05$, the critical value is roughly 6. \n\n\nThe Wald statistic is not a common test provided by standard statistical software functions like `lm()` in R, though it is fairly straightforward to implement \"by hand.\" Alternatively, packages like [`{aod}`](https://cran.r-project.org/web/packages/aod/index.html) or [`{clubSandwich}`](http://jepusto.github.io/clubSandwich/) have implementations of the test. What is reported by most software implementations of OLS (like `lm()` in R) is the F-statistic, which is\n$$ \nF = \\frac{W}{q},\n$$\nwhich also typically uses the the homoskedastic variance estimator $\\mb{V}^{\\texttt{lm}}_{\\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution is not really statistically justified, it is slightly more conservative than the $\\chi^2_q$ distribution, and the inference will converge as $n\\to\\infty$. So it might be justified as an *ad hoc* small sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and say we have a sample size of $n = 100$. In that case, the critical value for the F test with $\\alpha = 0.05$ is\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqf(0.95, df1 = 2, df2 = 100 - 4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.091191\n```\n:::\n:::\n\n\n\nThis result implies a critical value of 6.182 on the scale of the Wald statistic (multiplying it by $q = 2$). Compared to the earlier critical value of 5.991 based on the $\\chi^2_2$ distribution, we can see that the inferences will be very similar even in moderately-sized datasets. \n\nFinally, note that the F-statistic reported by `lm()` in R is the test of all the coefficients except the intercept being 0. In modern quantitative social sciences, this test is seldom substantively interesting. \n\n\n## Finite-sample properties with a linear CEF\n\nAll the above results have been large-sample properties, and we have not addressed finite-sample properties like the sampling variance or unbiasedness. Under the linear projection assumption above, OLS is generally biased without stronger assumptions. This section introduces the stronger assumption that will allow us to establish stronger properties for OLS. As usual, however, remember that these stronger assumptions can be wrong. \n\n\n::: {.callout-note}\n## Assumption: Linear Regression Model\n1. The variables $(Y_{i}, \\X_{i})$ satisfy the linear CEF assumption.\n$$ \n\\begin{aligned}\n Y_{i} &= \\X_{i}'\\bfbeta + e_{i} \\\\\n \\E[e_{i}\\mid \\X_{i}] & = 0.\n\\end{aligned}\n$$\n\n2. The design matrix is invertible $\\E[\\X_{i}\\X_{i}'] > 0$ (positive definite).\n:::\n\n\nWe discussed the concept of a linear CEF extensively in @sec-regression. However, recall that the CEF might be linear mechanically if the model is **saturated** or when there are as many coefficients in the model as there are unique values of $\\X_i$. When a model is not saturated, the linear CEF assumption is just that: an assumption. What can this assumption do? It can actually establish quite a few nice statistical properties in finite samples. \n\nOne note before we proceed. When focusing on the finite sample inference for OLS, it is customary to focus on its properties **conditional on the observed covariates**, such as $\\E[\\bhat \\mid \\Xmat]$ or $\\V[\\bhat \\mid \\Xmat]$. The historical reason for this was that the researcher often chose these independent variables, so they were not random. Thus, you'll sometimes see $\\Xmat$ treated as \"fixed\" in some older texts, and they might even omit explicit conditioning statements. \n\n\n::: {#thm-ols-unbiased}\n\nUnder the linear regression model assumption, OLS is unbiased for the population regression coefficients, \n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta,\n$$\nand its conditional sampling variance issue\n$$\n\\mb{\\V}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nwhere $\\sigma^2_{i} = \\E[e_{i}^{2} \\mid \\Xmat]$. \n:::\n\n\n::: {.proof}\n\nTo prove the conditional unbiasedness, recall that we can write the OLS estimator as\n$$\n\\bhat = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e},\n$$\nand so taking (conditional) expectations, we have,\n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta + \\E[(\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\E[\\mb{e} \\mid \\Xmat] = \\bfbeta,\n$$\nbecause under the linear CEF assumption $\\E[\\mb{e}\\mid \\Xmat] = 0$. \n\nFor the conditional sampling variance, we can use the same decomposition we have,\n$$\n\\V[\\bhat \\mid \\Xmat] = \\V[\\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = (\\Xmat'\\Xmat)^{-1}\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat(\\Xmat'\\Xmat)^{-1}. \n$$\nSince $\\E[\\mb{e}\\mid \\Xmat] = 0$, we know that $\\V[\\mb{e}\\mid \\Xmat] = \\E[\\mb{ee}' \\mid \\Xmat]$, which is a matrix with diagonal entries $\\E[e_{i}^{2} \\mid \\Xmat] = \\sigma^2_i$ and off-diagonal entries $\\E[e_{i}e_{j} \\Xmat] = \\E[e_{i}\\mid \\Xmat]\\E[e_{j}\\mid\\Xmat] = 0$, where the first equality follows from the independence of the errors across units. Thus, $\\V[\\mb{e} \\mid \\Xmat]$ is a diagonal matrix with $\\sigma^2_i$ along the diagonal, which means\n$$\n\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat = \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i',\n$$\nestablishing the conditional sampling variance.\n \n:::\n\nThus, for any realization of the covariates, $\\Xmat$, OLS is unbiased for the true regression coefficients $\\bfbeta$. By the law of iterated expectation, we also know that it is unconditionally unbiased[^unconditional] as well since\n$$\n\\E[\\bhat] = \\E[\\E[\\bhat \\mid \\Xmat]] = \\bfbeta. \n$$\nThe difference between these two statements usually isn't incredibly meaningful. \n\n[^unconditional]: We are basically ignoring some edge cases when it comes to discrete covariates here. In particular, we assume that $\\Xmat'\\Xmat$ is nonsingular with probability one. However, this can fail if we have a binary covariate since there is some chance (however slight) that the entire column will be all ones or all zeros, which would lead to a singular matrix $\\Xmat'\\Xmat$. Practically this is not a big deal, but it does mean that we have to ignore this issue theoretically or focus on conditional unbiasedness. \n\n\nThere are a lot of variances flying around, so it's helpful to review them. Above, we derived the asymptotic variance of $\\mb{Z}_{n} = \\sqrt{n}(\\bhat - \\bfbeta)$, \n$$\n\\mb{V}_{\\bfbeta} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1},\n$$\nwhich implies that the approximate variance of $\\bhat$ will be $\\mb{V}_{\\bfbeta} / n$ because\n$$\n\\bhat = \\frac{Z_n}{\\sqrt{n}} + \\bfbeta \\quad\\implies\\quad \\bhat \\overset{a}{\\sim} \\N(\\bfbeta, n^{-1}\\mb{V}_{\\bfbeta}),\n$$\nwhere $\\overset{a}{\\sim}$ means approximately asymptotically distributed as. Under the linear CEF, the conditional sampling variance of $\\bhat$ has a similar form and will be similar to the \n$$\n\\mb{V}_{\\bhat} = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\approx \\mb{V}_{\\bfbeta} / n\n$$\nIn practice, these two derivations lead to basically the same variance estimator. Recall the heteroskedastic-consistent variance estimator is\n$$\n\\widehat{\\mb{V}}_{\\bfbeta} = \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n$$\nis a valid plug-in estimator for the asymptotic variance and\n$$\n\\widehat{\\mb{V}}_{\\bhat} = n^{-1}\\widehat{\\mb{V}}_{\\bfbeta}.\n$$\nThus, in practice, the asymptotic and finite-sample results under a linear CEF justify the same variance estimator. \n\n\n### Linear CEF model under homoskedasticity\n\nIf we are willing to make a homoskedasticity assumption on the errors, we can derive even stronger results for OLS. Stronger assumptions typically lead to stronger conclusions, but those conclusions may not be robust to assumption violations. But homoskedasticity is such a historically important assumption that statistical software implementations of OLS like `lm()` in R assume it. \n\n::: {.callout-note}\n\n## Assumption: Homoskedasticity with a linear CEF\n\nIn addition to the linear CEF assumption, we further assume that\n$$\n\\E[e_i^2 \\mid \\X_i] = \\E[e_i^2] = \\sigma^2,\n$$\nor that variance of the errors does not depend on the covariates. \n:::\n\n\n::: {#thm-homoskedasticity}\n\nUnder a linear CEF model with homoskedastic errors, the conditional sampling variance is\n$$\n\\mb{V}^{\\texttt{lm}}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\sigma^2 \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nand the variance estimator \n$$\n\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} = \\widehat{\\sigma}^2 \\left( \\Xmat'\\Xmat \\right)^{-1} \\quad\\text{where,}\\quad \\widehat{\\sigma}^2 = \\frac{1}{n - k - 1} \\sum_{i=1}^n \\widehat{e}_i^2\n$$\nis unbiased, $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n::: \n\n::: {.proof}\nUnder homoskedasticity $\\sigma^2_i = \\sigma^2$ for all $i$. Recall that $\\sum_{i=1}^n \\X_i\\X_i' = \\Xmat'\\Xmat$ Thus, the conditional sampling variance from @thm-ols-unbiased, \n$$ \n\\begin{aligned}\n\\V[\\bhat \\mid \\Xmat] &= \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2 \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\ &= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\Xmat'\\Xmat \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1} = \\mb{V}^{\\texttt{lm}}_{\\bhat}.\n\\end{aligned}\n$$\n\nFor unbiasedness, we just need to show that $\\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] = \\sigma^2$. Recall that we defined $\\mb{M}_{\\Xmat}$ as the residual-maker because $\\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}}$. We can use this to connect the residuals to the errors,\n$$ \n\\mb{M}_{\\Xmat}\\mb{e} = \\mb{M}_{\\Xmat}\\mb{Y} - \\mb{M}_{\\Xmat}\\Xmat\\bfbeta = \\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}},\n$$ \nso \n$$\n\\V[\\widehat{\\mb{e}} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\V[\\mb{e} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\sigma^2,\n$$\nwhere the first equality is because $\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'$ is constant conditional on $\\Xmat$. Notice that the diagonal entries of this matrix are the variances of particular residuals $\\widehat{e}_i$ and that the diagonal entries of the annihilator matrix are $1 - h_{ii}$ (since the $h_{ii}$ are the diagonal entries of $\\mb{P}_{\\Xmat}$). Thus, we have\n$$ \n\\V[\\widehat{e}_i \\mid \\Xmat] = \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] = (1 - h_{ii})\\sigma^{2}.\n$$\nIn the last chapter, we established one property of these leverage values in @sec-leverage is that $\\sum_{i=1}^n h_{ii} = k+ 1$, so $\\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have\n$$ \n\\begin{aligned}\n \\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] &= \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] \\\\\n &= \\frac{\\sigma^{2}}{n-k-1} \\sum_{i=1}^{n} 1 - h_{ii} \\\\\n &= \\sigma^{2}\n\\end{aligned}\n$$\nThis establishes $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n\n:::\n\n\nThus, under the linear CEF model and homoskedasticity of the errors, we have an unbiased variance estimator that is a simple function of the sum of squared residuals and the design matrix. Most statistical software packages estimate standard errors using $\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}$. \n\n\nThe final result we can derive for the linear CEF under homoskedasticity is an optimality result. We might ask ourselves if there is another estimator for $\\bfbeta$ that would outperform OLS in the sense of having a lower sampling variance. Perhaps surprisingly, no linear estimator for $\\bfbeta$ has a lower conditional variance, meaning that OLS is the **best linear unbiased estimator**, often jovially shortened to BLUE. This result is famously known as the Gauss-Markov Theorem.\n\n::: {#thm-gauss-markov}\n\nLet $\\widetilde{\\bfbeta} = \\mb{AY}$ be a linear and unbiased estimator for $\\bfbeta$. Under the linear CEF model with homoskedastic errors, \n$$\n\\V[\\widetilde{\\bfbeta}\\mid \\Xmat] \\geq \\V[\\bhat \\mid \\Xmat]. \n$$\n\n:::\n\n::: {.proof}\nNote that if $\\widetilde{\\bfbeta}$ is unbiased then $\\E[\\widetilde{\\bfbeta} \\mid \\Xmat] = \\bfbeta$ and so \n$$\n\\bfbeta = \\E[\\mb{AY} \\mid \\Xmat] = \\mb{A}\\E[\\mb{Y} \\mid \\Xmat] = \\mb{A}\\Xmat\\bfbeta,\n$$\nwhich implies that $\\mb{A}\\Xmat = \\mb{I}_n$. \nRewrite the competitor as $\\widetilde{\\bfbeta} = \\bhat + \\mb{BY}$ where,\n$$ \n\\mb{B} = \\mb{A} - \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'.\n$$\nand note that $\\mb{A}\\Xmat = \\mb{I}_n$ implies that $\\mb{B}\\Xmat = 0$. We now have\n$$ \n\\begin{aligned}\n \\widetilde{\\bfbeta} &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{Y} \\\\\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\mb{B}\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e}\n\\end{aligned}\n$$\nThe variance of the competitor is, thus, \n$$ \n\\begin{aligned}\n \\V[\\widetilde{\\bfbeta} \\mid \\Xmat]\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\V[\\mb{e}\\mid \\Xmat]\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)' \\\\\n &= \\sigma^{2}\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\left( \\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{B}'\\right) \\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\mb{B}' + \\mb{B}\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &\\geq \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1} \\\\\n &= \\V[\\bhat \\mid \\Xmat]\n\\end{aligned}\n$$\nThe first equality comes from the properties of covariance matrices; the second is due to homoskedasticity; the fourth is due to $\\mb{B}\\Xmat = 0$, which implies that $\\Xmat'\\mb{B}' = 0$ as well. The fifth inequality holds because matrix products of the form $\\mb{BB}'$ are positive definite if $\\mb{B}$ is of full rank (which we have assumed it is). \n\n:::\n\nIn this proof, we saw that the variance of the competing estimator had variance $\\sigma^2\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)$ which we argued was \"greater than 0\" in the matrix sense, which is also called positive definite. What does this mean practically? Remember that any positive definite matrix must have strictly positive diagonal entries and that the diagonal entries of $\\V[\\bhat \\mid \\Xmat]$ and $V[\\widetilde{\\bfbeta}\\mid \\Xmat]$ are the variances of the individual parameters, $\\V[\\widehat{\\beta}_{j} \\mid \\Xmat]$ and $\\V[\\widetilde{\\beta}_{j} \\mid \\Xmat]$. Thus, the variances of the individual parameters will be larger for $\\widetilde{\\bfbeta}$ than for $\\bhat$.\n\nMany textbooks cite the Gauss-Markov theorem as a critical advantage of OLS over other methods, but it's essential to recognize its limitations. It requires linearity and homoskedastic error assumptions, which can be false in many applications. \n\nFinally, note that while we have shown this result for linear estimators, @Hansen22 proves a more general version of this result that applies to any unbiased estimator. \n\n## The normal linear model\n\nFinally, we add the strongest and thus least loved of the classical linear regression assumption: (conditional) normality of the errors. The historical reason to use this assumption was that finite-sample inference hits a roadblock without some knowledge of the sampling distribution of $\\bhat$. Under the linear CEF model, we saw that it was unbiased, and under homoskedasticity, we could produce an unbiased estimator of the conditional variance. But to do hypothesis testing or generate confidence intervals, we need to be able to make probability statements about the estimator, and for that, we need to know its exact distribution. When the sample size is large, we can rely on the CLT and know it is approximately normal. But in small samples, what do we do? Historically, we decided to assume (conditional) normality of the errors to proceed with some knowledge that we were wrong but hopefully not too wrong. \n\n\n::: {.callout-note}\n\n## The normal linear regression model\n\nIn addition to the linear CEF assumption, we assume that \n$$\ne_i \\mid \\Xmat \\sim \\N(0, \\sigma^2).\n$$\n\n:::\n\nA couple of things to point out: \n\n- The assumption here is not that $(Y_{i}, \\X_{i})$ are jointly normal (though this would be sufficient for the assumption to hold), but rather that $Y_i$ is normally distributed conditional on $\\X_i$. \n- Notice that the normal regression model has the homoskedasticity assumption baked in. \n\n::: {#thm-normal-ols}\n\nUnder the normal linear regression model, we have\n$$ \n\\begin{aligned}\n \\bhat \\mid \\Xmat &\\sim \\N\\left(\\bfbeta, \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1}\\right) \\\\\n \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}]_{jj}/\\sqrt{n}} &\\sim t_{n-k-1} \\\\\n W/q &\\sim F_{q, n-k-1}. \n\\end{aligned}\n$$\n\n:::\n\n\nThis theorem says that in the normal linear regression model, the coefficients follow a normal distribution, the t-statistics follow a $t$-distribution, and a transformation of the Wald statistic follows an $F$ distribution. These are **exact** results and do not rely on large-sample approximations. Under the assumption of conditional normality of the errors, they are as valid for $n = 5$ as for $n = 500,000$. \n\nFew people believe errors follow a normal distribution, so why even present these results? Unfortunately, most statistical software implementations of OLS implicitly assume this when calculating p-values for tests or constructing confidence intervals. That is, the p-value associated with the $t$-statistic that `lm()` outputs in R relies on the $t_{n-k-1}$ distribution, and the critical values used to construct confidence intervals with `confint()` use that distribution as well. When normality does not hold, there is no principled reason to use the $t$ or the $F$ distributions in this way. But we might hold our nose and use this *ad hoc* procedure under two rationalizations:\n\n- $\\bhat$ is asymptotically normal, but this approximation might be poor in smaller finite samples. The $t$ distribution will make inference more conservative in these cases (wider confidence intervals, smaller test rejection regions), which might help offset the poor approximation of the normal in small samples. \n- As $n\\to\\infty$, the $t_{n-k-1}$ will converge to a standard normal distribution, so the *ad hoc* adjustment will not matter much for medium to large samples. \n\nThese arguments are not very convincing since it's unclear whether the $t$ approximation will be any better than the normal in finite samples. But it's the best we can do to console ourselves as we find more data. \n", + "markdown": "# The statistics of least squares\n\nIn the last chapter, we derived the least squares estimator and investigated many of its mechanical properties. These properties are essential for the practical application of OLS. Still, we should also understand its statistical properties, such as the ones described in Part I: unbiasedness, sampling variance, consistency, and asymptotic normality. As we saw then, these properties fall into finite-sample (unbiasedness, sampling variance) and asymptotic (consistency, asymptotic normality). \n\nIn this chapter, we will focus first on the asymptotic properties of OLS because those properties hold under the relatively mild conditions of the linear projection model introduced in @sec-linear-projection. We will see that OLS consistently estimates a coherent quantity of interest (the best linear predictor) regardless of whether the conditional expectation is linear. That is, for the asymptotic properties of the estimator, we will not need the commonly invoked linearity assumption. Later, when we investigate the finite-sample properties, we will show how linearity will help us establish unbiasedness and how normality of the errors can allow us to conduct exact, finite-sample inference. But these assumptions are very strong, so it's vital to understand what we can say about OLS without making them. \n\n## Large-sample properties of OLS\n\nAs we saw in @sec-asymptotics, we need two key ingredients to conduct statistical inference with the OLS estimator: a consistent estimate of the variance of $\\bhat$ and the approximate distribution of $\\bhat$ in large samples. Remember that since $\\bhat$ is a vector, the variance of that estimator will actually be a variance-covariance matrix. To obtain these two ingredients, we will first establish the consistency of OLS and then use the central limit theorem to derive its asymptotic distribution, which will include its variance. \n\n\nWe begin by setting out the assumptions we will need for establishing the large-sample properties of OLS, which are the same as the assumptions needed to ensure that the best linear predictor, $\\bhat = \\E[\\X_{i}\\X_{i}']^{-1}\\E[\\X_{i}Y_{i}]$, is well-defined and unique. \n\n::: {.callout-note}\n\n### Linear projection assumptions\n\nThe linear projection model makes the following assumptions:\n\n1. $\\{(Y_{i}, \\X_{i})\\}_{i=1}^n$ are iid random vectors\n\n2. $\\E[Y^{2}_{i}] < \\infty$ (finite outcome variance)\n\n3. $\\E[\\Vert \\X_{i}\\Vert^{2}] < \\infty$ (finite variances and covariances of covariates)\n\n2. $\\E[\\X_{i}\\X_{i}']$ is positive definite (no linear dependence in the covariates)\n:::\n\n\nRecall that these are mild conditions on the joint distribution of $(Y_{i}, \\X_{i})$ and in particular, we are **not** assuming linearity of the CEF, $\\E[Y_{i} \\mid \\X_{i}]$, nor are we assuming any specific distribution for the data. \n\nWe can helpfully decompose the OLS estimator into the actual BLP coefficient plus estimation error as\n$$ \n\\bhat = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_iY_i \\right) = \\bfbeta + \\underbrace{\\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\right)}_{\\text{estimation error}}.\n$$ \n \nThis decomposition will help us quickly establish the consistency of $\\bhat$. By the law of large numbers, we know that sample means will converge in probability to population expectations, so we have\n$$ \n\\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\inprob \\E[\\X_i\\X_i'] \\equiv \\mb{Q}_{\\X\\X} \\qquad \\frac{1}{n} \\sum_{i=1}^n \\X_ie_i \\inprob \\E[\\X_{i} e_{i}] = \\mb{0},\n$$\nwhich implies that \n$$\n\\bhat \\inprob \\bfbeta + \\mb{Q}_{\\X\\X}^{-1}\\E[\\X_ie_i] = \\bfbeta,\n$$\nby the continuous mapping theorem (the inverse is a continuous function). The linear projection assumptions ensure that LLN applies to these sample means and ensure that $\\E[\\X_{i}\\X_{i}']$ is invertible. \n\n\n::: {#thm-ols-consistency}\nUnder the above linear projection assumptions, the OLS estimator is consistent for the best linear projection coefficients, $\\bhat \\inprob \\bfbeta$.\n:::\n\nThus, OLS should be close to the population linear regression in large samples under relatively mild conditions. Remember that this might not equal the conditional expectation if the CEF is nonlinear. We can say here that OLS converges to the best *linear* approximation to the CEF. Of course, this also means that if the CEF is linear, then OLS will consistently estimate the coefficients of the CEF. \n\nTo emphasize here: the only assumption we made about the dependent variable is that it has finite variance and is iid. Under this assumption, the outcome could be continuous, categorical, binary, or event count. \n\n\nNext, we would like to establish an asymptotic normality result for the OLS coefficients. We first review some key ideas about the central limit theorem.\n\n::: {.callout-note}\n\n## CLT reminder\n\nSuppose that we have a function of the data iid random vectors $\\X_1, \\ldots, \\X_n$, $g(\\X_{i})$ where $\\E[g(\\X_{i})] = 0$ and so $\\V[g(\\X_{i})] = \\E[g(\\X_{i})g(\\X_{i})']$. Then if $\\E[\\Vert g(\\X_{i})\\Vert^{2}] < \\infty$, the CLT implies that\n$$ \n\\sqrt{n}\\left(\\frac{1}{n} \\sum_{i=1}^{n} g(\\X_{i}) - \\E[g(\\X_{i})]\\right) = \\frac{1}{\\sqrt{n}} \\sum_{i=1}^{n} g(\\X_{i}) \\indist \\N(0, \\E[g(\\X_{i})g(\\X_{i}')]) \n$$ {#eq-clt-mean-zero}\n:::\n\nWe now manipulate our decomposition to arrive at the *stabilized* version of the estimator,\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) = \\left( \\frac{1}{n} \\sum_{i=1}^n \\X_i\\X_i' \\right)^{-1} \\left( \\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\right).\n$$\nWe have already established that the first term on the right-hand side will converge in probability to $\\mb{Q}_{\\X\\X}^{-1}$. Notice that $\\E[\\X_{i}e_{i}] = 0$, so we can apply @eq-clt-mean-zero to the second term. The covariance matrix of $\\X_ie_{i}$ is \n$$ \n\\mb{\\Omega} = \\V[\\X_{i}e_{i}] = \\E[\\X_{i}e_{i}(\\X_{i}e_{i})'] = \\E[e_{i}^{2}\\X_{i}\\X_{i}'].\n$$ \nThe CLT will imply that\n$$ \n\\frac{1}{\\sqrt{n}} \\sum_{i=1}^n \\X_ie_i \\indist \\N(0, \\mb{\\Omega}).\n$$\nCombining these facts with Slutsky's Theorem implies the following theorem. \n\n::: {#thm-ols-asymptotic-normality}\n\nSuppose that the linear projection assumptions hold and, in addition, we have $\\E[Y_{i}^{4}] < \\infty$ and $\\E[\\lVert\\X_{i}\\rVert^{4}] < \\infty$. Then the OLS estimator is asymptotically normal with\n$$ \n\\sqrt{n}\\left( \\bhat - \\bfbeta\\right) \\indist \\N(0, \\mb{V}_{\\bfbeta}),\n$$\nwhere\n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1}.\n$$\n\n:::\n\nThus, if the sample size is large enough, we can approximate the distribution of $\\bhat$ with a multivariate normal with mean $\\bfbeta$ and covariance matrix $\\mb{V}_{\\bfbeta}/n$. In particular, the square root of the $j$th diagonals of this matrix will be standard errors for $\\widehat{\\beta}_j$. Knowing the shape of the OLS estimator's multivariate distribution will allow us to conduct hypothesis tests and generate confidence intervals for both individual coefficients and groups of coefficients. But first, we need an estimate of the covariance matrix!\n\n\n\n## Variance estimation for OLS\n\nThe asymptotic normality of OLS from the last section is of limited value without some way to estimate the covariance matrix, \n$$ \n\\mb{V}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\mb{\\Omega}\\mb{Q}_{\\X\\X}^{-1}.\n$$\nSince each term here is a population mean, this is an ideal place to drop a plug-in estimator. In particular, let's use the following estimators:\n$$ \n\\begin{aligned}\n \\mb{Q}_{\\X\\X} &= \\E[\\X_{i}\\X_{i}'] & \\widehat{\\mb{Q}}_{\\X\\X} &= \\frac{1}{n} \\sum_{i=1}^{n} \\X_{i}\\X_{i}' = \\frac{1}{n}\\Xmat'\\Xmat \\\\\n \\mb{\\Omega} &= \\E[e_i^2\\X_i\\X_i'] & \\widehat{\\mb{\\Omega}} & = \\frac{1}{n}\\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i'.\n\\end{aligned}\n$$\nUnder the assumptions of @thm-ols-asymptotic-normality, the LLN will imply that these are consistent for their targets, $\\widehat{\\mb{Q}}_{\\X\\X} \\inprob \\mb{Q}_{\\X\\X}$ and $\\widehat{\\mb{\\Omega}} \\inprob \\mb{\\Omega}$. We can plug these into the variance formula to arrive at\n$$ \n\\begin{aligned}\n \\widehat{\\mb{V}}_{\\bfbeta} &= \\widehat{\\mb{Q}}_{\\X\\X}^{-1}\\widehat{\\mb{\\Omega}}\\widehat{\\mb{Q}}_{\\X\\X}^{-1} \\\\\n &= \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n\\end{aligned}\n$$\nwhich by the continuous mapping theorem is consistent, $\\widehat{\\mb{V}}_{\\bfbeta} \\inprob \\mb{V}_{\\bfbeta}$. \n\nThis estimator is sometimes called the **robust variance estimator** or, more accurately, the **heteroskedasticity-consistent (HC) variance estimator**. How is this robust? Consider the standard **homoskedasticity** assumption that most statistical software packages make when estimating OLS variances: the variance of the errors does not depend on the covariates: $\\V[e_{i}^{2} \\mid \\X_{i}] = \\V[e_{i}^{2}]$. This assumption is stronger than we need, and we can rely on a weaker assumption that the squared errors are uncorrelated with a specific function of the covariates: \n$$ \n\\E[e_{i}^{2}\\X_{i}\\X_{i}'] = \\E[e_{i}^{2}]\\E[\\X_{i}\\X_{i}'] = \\sigma^{2}\\mb{Q}_{\\X\\X}, \n$$\nwhere $\\sigma^2$ is the variance of the residuals (since $\\E[e_{i}] = 0$). Homoskedasticity simplifies the asymptotic variance of the stabilized estimator, $\\sqrt{n}(\\bhat - \\bfbeta)$, to\n$$ \n\\mb{V}^{\\texttt{lm}}_{\\bfbeta} = \\mb{Q}_{\\X\\X}^{-1}\\sigma^{2}\\mb{Q}_{\\X\\X}\\mb{Q}_{\\X\\X}^{-1} = \\sigma^2\\mb{Q}_{\\X\\X}^{-1}.\n$$\nWe already have an estimator for $\\mb{Q}_{\\X\\X}$, but we need one for $\\sigma^2$. We can easily use the SSR,\n$$ \n\\widehat{\\sigma}^{2} = \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\widehat{e}_{i}^{2},\n$$\nwhere we use $n-k-1$ in the denominator instead of $n$ to correct for the residuals being slightly less variable than the actual errors (because OLS mechanically attempts to make the residuals small). For consistent variance estimation, $n-k -1$ or $n$ can be used, since either way $\\widehat{\\sigma}^2 \\inprob \\sigma^2$. Thus, under homoskedasticity, we have\n$$ \n\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}} = \\widehat{\\sigma}^{2}\\left(\\Xmat'\\Xmat\\right)^{{-1}},\n$$\nwhich is the standard variance estimator used by `lm()` in R or `reg` in Stata. \n\n\nNow that we have two estimators, $\\widehat{\\mb{V}}_{\\bfbeta}$ and $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$, how do they compare? Notice that the HC variance estimator and the homoskedasticity variance estimator will both be consistent when homoskedasticity holds. But as the \"heteroskedasticity-consistent\" label implies, only the HC variance estimator will be consistent when homoskedasticity fails to hold. So $\\widehat{\\mb{V}}_{\\bfbeta}$ has the advantage of being consistent regardless of this assumption. This advantage comes at a cost, however. When homoskedasticity is correct, $\\widehat{\\mb{V}}_{\\bfbeta}^{\\texttt{lm}}$ incorporates that assumption into the estimator where the HC variance estimator has to estimate it. The HC estimator will have higher variance (the variance estimator will be more variable!) when homoskedasticity actually does hold. \n\n\n\n\n\nNow that we have established the asymptotic normality of the OLS estimator and developed a consistent estimator of its variance, we can proceed with all of the statistical inference tools we discussed in Part I of this guide. Define the estimated **heteroskedasticity-consistent standard errors** as\n$$ \n\\widehat{\\se}(\\widehat{\\beta}_{j}) = \\sqrt{\\frac{[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}}{n}},\n$$\nwhere $[\\widehat{\\mb{V}}_{\\bfbeta}]_{jj}$ is the $j$th diagonal entry of the HC variance estimator. Note that we divide by $\\sqrt{n}$ here because $\\widehat{\\mb{V}}_{\\bfbeta}$ is a consistent estimator of the stabilized estimator $\\sqrt{n}(\\bhat - \\bfbeta)$ not the estimator itself. \n\nHypothesis tests and confidence intervals for individual coefficients are almost precisely the same as with the general case presented in Part I. For a two-sided test of $H_0: \\beta_j = b$ versus $H_1: \\beta_j \\neq b$, we can build the t-statistic and conclude that, under the null,\n$$\n\\frac{\\widehat{\\beta}_j - b}{\\widehat{\\se}(\\widehat{\\beta}_{j})} \\indist \\N(0, 1).\n$$\nTypically, statistical software will helpfully provide the t-statistic for the null of no (partial) linear relationship between $X_{ij}$ and $Y_i$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nwhich measures how large the estimated coefficient is in standard errors. With $\\alpha = 0.05$, asymptotic normality would imply that we reject this null when $t > 1.96$. We can form asymptotically-valid confidence intervals with \n$$ \n\\left[\\widehat{\\beta}_{j} - z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j}),\\;\\widehat{\\beta}_{j} + z_{\\alpha/2}\\;\\widehat{\\se}(\\widehat{\\beta}_{j})\\right]. \n$$\nFor reasons we will discuss below, standard software typically relies on the $t$ distribution instead of the normal for hypothesis testing and confidence intervals. Still, this difference is of little consequence in large samples. \n\n## Inference for multiple parameters\n\nWith multiple coefficients, we might have hypotheses that involve more than one coefficient. As an example, let's focus on a regression with an interaction between two covariates, \n$$\nY_i = \\beta_0 + X_i\\beta_1 + Z_i\\beta_2 + X_iZ_i\\beta_3 + e_i.\n$$\nSuppose we wanted to test the hypothesis that $X_i$ does not affect the best linear predictor for $Y_i$. That would be\n$$ \nH_{0}: \\beta_{1} = 0 \\text{ and } \\beta_{3} = 0\\quad\\text{vs}\\quad H_{1}: \\beta_{1} \\neq 0 \\text{ or } \\beta_{3} \\neq 0,\n$$\nwhere we usually write the null more compactly as $H_0: \\beta_1 = \\beta_3 = 0$. \n\nTo test this null hypothesis, we need a test statistic that discriminates the two hypotheses: it should be large when the alternative is true and small when the null is true. With a single coefficient, we usually test the null hypothesis of $H_0: \\beta_j = b_0$ with the $t$-statistic, \n$$ \nt = \\frac{\\widehat{\\beta}_{j} - b_{0}}{\\widehat{\\se}(\\widehat{\\beta}_{j})},\n$$\nand we usually take the absolute value, $|t|$, as our measure of how far our estimate is from the null. But notice that we could also use the square of the $t$ statistic, which is\n$$ \nt^{2} = \\frac{\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{\\V[\\widehat{\\beta}_{j}]} = \\frac{n\\left(\\widehat{\\beta}_{j} - b_{0}\\right)^{2}}{[\\mb{V}_{\\bfbeta}]_{[jj]}} \n$$ {#eq-squared-t}\n\nSo here's another way to differentiate the null from the alternative: the squared distance between them divided by the variance of the estimate. \n\nCan we generalize this idea to hypotheses about multiple parameters? Adding the sum of squared distances for each component of the null hypothesis is straightforward. For our interaction example, that would be\n$$ \n\\widehat{\\beta}_1^2 + \\widehat{\\beta}_3^2, \n$$\nbut remember that some of the estimated coefficients are noisier than others, so we should account for the uncertainty, just like we did for the $t$-statistic. \n\nWith multiple parameters and multiple coefficients, the variances will now require matrix algebra. We can write any hypothesis about linear functions of the coefficients as $H_{0}: \\mb{L}\\bfbeta = \\mb{c}$. For example, in the interaction case, we have\n$$ \n\\mb{L} =\n\\begin{pmatrix}\n 0 & 1 & 0 & 0 \\\\\n 0 & 0 & 0 & 1 \\\\\n\\end{pmatrix}\n\\qquad\n\\mb{c} =\n\\begin{pmatrix}\n 0 \\\\\n 0\n\\end{pmatrix}\n$$\nThus, $\\mb{L}\\bfbeta = \\mb{0}$ is equivalent to $\\beta_1 = 0$ and $\\beta_3 = 0$. Notice that with other $\\mb{L}$ matrices, we could represent more complicated hypotheses like $2\\beta_1 - \\beta_2 = 34$, though we mostly stick to simpler functions. Let $\\widehat{\\bs{\\theta}} = \\mb{L}\\bhat$ be the OLS estimate of the function of the coefficients. By the delta method (discussed in @sec-delta-method), we have\n$$ \n\\sqrt{n}\\left(\\mb{L}\\bhat - \\mb{L}\\bfbeta\\right) \\indist \\N(0, \\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}).\n$$\nWe can now generalize the squared $t$ statistic in @eq-squared-t. In particular, we will take the distances $\\mb{L}\\bhat - \\mb{c}$ weighted by the variance-covariance matrix $\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L}$, \n$$ \nW = n(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}(\\mb{L}\\bhat - \\mb{c}),\n$$\nwhich is called the **Wald test statistic**. This statistic generalizes the ideas of the t-statistic to multiple parameters. With the t-statistic, we recenter to have mean 0 and divide by the standard error to get a variance of 1. If we ignore the middle variance weighting, we have $(\\mb{L}\\bhat - \\mb{c})'(\\mb{L}\\bhat - \\mb{c})$ which is just the sum of the squared deviations of the estimates from the null. Including the $(\\mb{L}'\\mb{V}_{\\bfbeta}\\mb{L})^{-1}$ weight has the effect of rescaling the distribution of $\\mb{L}\\bhat - \\mb{c}$ to make it rotationally symmetric around 0 (so the resulting dimensions are uncorrelated) with each dimension having an equal variance of 1. In this way, the Wald statistic transforms the random vectors to be mean-centered and have variance 1 (just the t-statistic), but also to have the resulting random variables in the vector be uncorrelated.[^norms]\n\n\n[^norms]: The form of the Wald statistic is that of a weighted inner product, $\\mb{x}'\\mb{Ay}$, where $\\mb{A}$ is a symmetric positive-definite weighting matrix. \n\nWhy transform the data in this way? @fig-wald shows the contour plot of a hypothetical joint distribution of two coefficients from an OLS regression. We might want to know how far different points in the distribution are from the mean, which in this case is $(1, 2)$. Without considering the joint distribution, the circle is obviously closer to the mean than the triangle. However, looking at where the two points are on the distribution, the circle is at a lower contour than the triangle, meaning it is more extreme than the triangle for this particular distribution. The Wald statistic, then, takes into consideration how much of a \"climb\" it is for $\\mb{L}\\bhat$ to get to $\\mb{c}$ given the distribution of $\\mb{L}\\bhat$.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Hypothetical joint distribution of two slope coefficients. The circle is closer to the center of the distribution by the standard Euclidean distance, but the triangle is closer once you consider the joint distribution.](08_ols_properties_files/figure-pdf/fig-wald-1.pdf){#fig-wald}\n:::\n:::\n\n\n\n\nIf $\\mb{L}$ only has one row, our Wald statistic is the same as the squared $t$ statistic, $W = t^2$. This fact will help us think about the asymptotic distribution of $W$. Notice that as $n\\to\\infty$, we know that by the asymptotic normality of $\\bhat$,\n$$ \nt = \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{\\widehat{\\se}[\\widehat{\\beta}_{j}]} \\indist \\N(0,1)\n$$\nso $t^2$ will converge in distribution to a $\\chi^2_1$ (since a $\\chi^2_1$ is just one standard normal squared). After recentering and rescaling by the covariance matrix, $W$ converges to the sum of $q$ squared independent normals, where $q$ is the number of rows of $\\mb{L}$, or equivalently, the number of restrictions implied by the null hypothesis. Thus, under the null hypothesis of $\\mb{L}\\bhat = \\mb{c}$, we have $W \\indist \\chi^2_{q}$. \n\n\n::: {.callout-note}\n\n## Chi-squared critical values\n\nWe can obtain critical values for the $\\chi^2_q$ distribution using the `qchisq()` function in R. For example, if we wanted to obtain the critical value $w$ that such that $\\P(W > w_{\\alpha}) = \\alpha$ for our two-parameter interaction example, we could use:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqchisq(p = 0.95, df = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5.991465\n```\n:::\n:::\n\n\n:::\n\n\nWe need to define the rejection region to use the Wald statistic in a test. Because we are squaring each distance in $W \\geq 0$, larger values of $W$ indicate more disagreement with the null in either direction. Thus, for an $\\alpha$-level test of the joint null, we only need a one-sided rejection region of the form $\\P(W > w_{\\alpha}) = \\alpha$. Obtaining these values is straightforward (see the above callout tip). For $q = 2$ and a $\\alpha = 0.05$, the critical value is roughly 6. \n\n\nThe Wald statistic is not a common test provided by standard statistical software functions like `lm()` in R, though it is fairly straightforward to implement \"by hand.\" Alternatively, packages like [`{aod}`](https://cran.r-project.org/web/packages/aod/index.html) or [`{clubSandwich}`](http://jepusto.github.io/clubSandwich/) have implementations of the test. What is reported by most software implementations of OLS (like `lm()` in R) is the F-statistic, which is\n$$ \nF = \\frac{W}{q},\n$$\nwhich also typically uses the homoskedastic variance estimator $\\mb{V}^{\\texttt{lm}}_{\\bfbeta}$ in $W$. The p-values reported for such tests use the $F_{q,n-k-1}$ distribution because this is the exact distribution of the $F$ statistic when the errors are (a) homoskedastic and (b) normally distributed. When these assumptions do not hold, the $F$ distribution is not really statistically justified, it is slightly more conservative than the $\\chi^2_q$ distribution, and the inference will converge as $n\\to\\infty$. So it might be justified as an *ad hoc* small sample adjustment to the Wald test. For example, if we used the $F_{q,n-k-1}$ with the interaction example where $q=2$ and say we have a sample size of $n = 100$. In that case, the critical value for the F test with $\\alpha = 0.05$ is\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqf(0.95, df1 = 2, df2 = 100 - 4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.091191\n```\n:::\n:::\n\n\n\nThis result implies a critical value of 6.182 on the scale of the Wald statistic (multiplying it by $q = 2$). Compared to the earlier critical value of 5.991 based on the $\\chi^2_2$ distribution, we can see that the inferences will be very similar even in moderately-sized datasets. \n\nFinally, note that the F-statistic reported by `lm()` in R is the test of all the coefficients except the intercept being 0. In modern quantitative social sciences, this test is seldom substantively interesting. \n\n\n## Finite-sample properties with a linear CEF\n\nAll the above results have been large-sample properties, and we have not addressed finite-sample properties like the sampling variance or unbiasedness. Under the linear projection assumption above, OLS is generally biased without stronger assumptions. This section introduces the stronger assumption that will allow us to establish stronger properties for OLS. As usual, however, remember that these stronger assumptions can be wrong. \n\n\n::: {.callout-note}\n## Assumption: Linear Regression Model\n1. The variables $(Y_{i}, \\X_{i})$ satisfy the linear CEF assumption.\n$$ \n\\begin{aligned}\n Y_{i} &= \\X_{i}'\\bfbeta + e_{i} \\\\\n \\E[e_{i}\\mid \\X_{i}] & = 0.\n\\end{aligned}\n$$\n\n2. The design matrix is invertible $\\E[\\X_{i}\\X_{i}'] > 0$ (positive definite).\n:::\n\n\nWe discussed the concept of a linear CEF extensively in @sec-regression. However, recall that the CEF might be linear mechanically if the model is **saturated** or when there are as many coefficients in the model as there are unique values of $\\X_i$. When a model is not saturated, the linear CEF assumption is just that: an assumption. What can this assumption do? It can actually establish quite a few nice statistical properties in finite samples. \n\nOne note before we proceed. When focusing on the finite sample inference for OLS, it is customary to focus on its properties **conditional on the observed covariates**, such as $\\E[\\bhat \\mid \\Xmat]$ or $\\V[\\bhat \\mid \\Xmat]$. The historical reason for this was that the researcher often chose these independent variables, so they were not random. Thus, you'll sometimes see $\\Xmat$ treated as \"fixed\" in some older texts, and they might even omit explicit conditioning statements. \n\n\n::: {#thm-ols-unbiased}\n\nUnder the linear regression model assumption, OLS is unbiased for the population regression coefficients, \n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta,\n$$\nand its conditional sampling variance is\n$$\n\\mb{\\V}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nwhere $\\sigma^2_{i} = \\E[e_{i}^{2} \\mid \\Xmat]$. \n:::\n\n\n::: {.proof}\n\nTo prove the conditional unbiasedness, recall that we can write the OLS estimator as\n$$\n\\bhat = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e},\n$$\nand so taking (conditional) expectations, we have,\n$$\n\\E[\\bhat \\mid \\Xmat] = \\bfbeta + \\E[(\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = \\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\E[\\mb{e} \\mid \\Xmat] = \\bfbeta,\n$$\nbecause under the linear CEF assumption $\\E[\\mb{e}\\mid \\Xmat] = 0$. \n\nFor the conditional sampling variance, we can use the same decomposition we have,\n$$\n\\V[\\bhat \\mid \\Xmat] = \\V[\\bfbeta + (\\Xmat'\\Xmat)^{-1}\\Xmat'\\mb{e} \\mid \\Xmat] = (\\Xmat'\\Xmat)^{-1}\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat(\\Xmat'\\Xmat)^{-1}. \n$$\nSince $\\E[\\mb{e}\\mid \\Xmat] = 0$, we know that $\\V[\\mb{e}\\mid \\Xmat] = \\E[\\mb{ee}' \\mid \\Xmat]$, which is a matrix with diagonal entries $\\E[e_{i}^{2} \\mid \\Xmat] = \\sigma^2_i$ and off-diagonal entries $\\E[e_{i}e_{j} \\Xmat] = \\E[e_{i}\\mid \\Xmat]\\E[e_{j}\\mid\\Xmat] = 0$, where the first equality follows from the independence of the errors across units. Thus, $\\V[\\mb{e} \\mid \\Xmat]$ is a diagonal matrix with $\\sigma^2_i$ along the diagonal, which means\n$$\n\\Xmat'\\V[\\mb{e} \\mid \\Xmat]\\Xmat = \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i',\n$$\nestablishing the conditional sampling variance.\n \n:::\n\nThus, for any realization of the covariates, $\\Xmat$, OLS is unbiased for the true regression coefficients $\\bfbeta$. By the law of iterated expectation, we also know that it is unconditionally unbiased[^unconditional] as well since\n$$\n\\E[\\bhat] = \\E[\\E[\\bhat \\mid \\Xmat]] = \\bfbeta. \n$$\nThe difference between these two statements usually isn't incredibly meaningful. \n\n[^unconditional]: We are basically ignoring some edge cases when it comes to discrete covariates here. In particular, we assume that $\\Xmat'\\Xmat$ is nonsingular with probability one. However, this can fail if we have a binary covariate since there is some chance (however slight) that the entire column will be all ones or all zeros, which would lead to a singular matrix $\\Xmat'\\Xmat$. Practically this is not a big deal, but it does mean that we have to ignore this issue theoretically or focus on conditional unbiasedness. \n\n\nThere are a lot of variances flying around, so it's helpful to review them. Above, we derived the asymptotic variance of $\\mb{Z}_{n} = \\sqrt{n}(\\bhat - \\bfbeta)$, \n$$\n\\mb{V}_{\\bfbeta} = \\left( \\E[\\X_i\\X_i'] \\right)^{-1}\\E[e_i^2\\X_i\\X_i']\\left( \\E[\\X_i\\X_i'] \\right)^{-1},\n$$\nwhich implies that the approximate variance of $\\bhat$ will be $\\mb{V}_{\\bfbeta} / n$ because\n$$\n\\bhat = \\frac{Z_n}{\\sqrt{n}} + \\bfbeta \\quad\\implies\\quad \\bhat \\overset{a}{\\sim} \\N(\\bfbeta, n^{-1}\\mb{V}_{\\bfbeta}),\n$$\nwhere $\\overset{a}{\\sim}$ means approximately asymptotically distributed as. Under the linear CEF, the conditional sampling variance of $\\bhat$ has a similar form and will be similar to the \n$$\n\\mb{V}_{\\bhat} = \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2_i \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\approx \\mb{V}_{\\bfbeta} / n\n$$\nIn practice, these two derivations lead to basically the same variance estimator. Recall the heteroskedastic-consistent variance estimator\n$$\n\\widehat{\\mb{V}}_{\\bfbeta} = \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1} \\left( \\frac{1}{n} \\sum_{i=1}^n\\widehat{e}_i^2\\X_i\\X_i' \\right) \\left( \\frac{1}{n} \\Xmat'\\Xmat \\right)^{-1},\n$$\nis a valid plug-in estimator for the asymptotic variance and\n$$\n\\widehat{\\mb{V}}_{\\bhat} = n^{-1}\\widehat{\\mb{V}}_{\\bfbeta}.\n$$\nThus, in practice, the asymptotic and finite-sample results under a linear CEF justify the same variance estimator. \n\n\n### Linear CEF model under homoskedasticity\n\nIf we are willing to make a homoskedasticity assumption on the errors, we can derive even stronger results for OLS. Stronger assumptions typically lead to stronger conclusions, but those conclusions may not be robust to assumption violations. But homoskedasticity is such a historically important assumption that statistical software implementations of OLS like `lm()` in R assume it. \n\n::: {.callout-note}\n\n## Assumption: Homoskedasticity with a linear CEF\n\nIn addition to the linear CEF assumption, we further assume that\n$$\n\\E[e_i^2 \\mid \\X_i] = \\E[e_i^2] = \\sigma^2,\n$$\nor that variance of the errors does not depend on the covariates. \n:::\n\n\n::: {#thm-homoskedasticity}\n\nUnder a linear CEF model with homoskedastic errors, the conditional sampling variance is\n$$\n\\mb{V}^{\\texttt{lm}}_{\\bhat} = \\V[\\bhat \\mid \\Xmat] = \\sigma^2 \\left( \\Xmat'\\Xmat \\right)^{-1},\n$$\nand the variance estimator \n$$\n\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} = \\widehat{\\sigma}^2 \\left( \\Xmat'\\Xmat \\right)^{-1} \\quad\\text{where,}\\quad \\widehat{\\sigma}^2 = \\frac{1}{n - k - 1} \\sum_{i=1}^n \\widehat{e}_i^2\n$$\nis unbiased, $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n::: \n\n::: {.proof}\nUnder homoskedasticity $\\sigma^2_i = \\sigma^2$ for all $i$. Recall that $\\sum_{i=1}^n \\X_i\\X_i' = \\Xmat'\\Xmat$. Thus, the conditional sampling variance from @thm-ols-unbiased, \n$$ \n\\begin{aligned}\n\\V[\\bhat \\mid \\Xmat] &= \\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\sigma^2 \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\ &= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\sum_{i=1}^n \\X_i\\X_i' \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1}\\left( \\Xmat'\\Xmat \\right) \\left( \\Xmat'\\Xmat \\right)^{-1} \\\\&= \\sigma^2\\left( \\Xmat'\\Xmat \\right)^{-1} = \\mb{V}^{\\texttt{lm}}_{\\bhat}.\n\\end{aligned}\n$$\n\nFor unbiasedness, we just need to show that $\\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] = \\sigma^2$. Recall that we defined $\\mb{M}_{\\Xmat}$ as the residual-maker because $\\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}}$. We can use this to connect the residuals to the errors,\n$$ \n\\mb{M}_{\\Xmat}\\mb{e} = \\mb{M}_{\\Xmat}\\mb{Y} - \\mb{M}_{\\Xmat}\\Xmat\\bfbeta = \\mb{M}_{\\Xmat}\\mb{Y} = \\widehat{\\mb{e}},\n$$ \nso \n$$\n\\V[\\widehat{\\mb{e}} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\V[\\mb{e} \\mid \\Xmat] = \\mb{M}_{\\Xmat}\\sigma^2,\n$$\nwhere the first equality is because $\\mb{M}_{\\Xmat} = \\mb{I}_{n} - \\Xmat (\\Xmat'\\Xmat)^{-1} \\Xmat'$ is constant conditional on $\\Xmat$. Notice that the diagonal entries of this matrix are the variances of particular residuals $\\widehat{e}_i$ and that the diagonal entries of the annihilator matrix are $1 - h_{ii}$ (since the $h_{ii}$ are the diagonal entries of $\\mb{P}_{\\Xmat}$). Thus, we have\n$$ \n\\V[\\widehat{e}_i \\mid \\Xmat] = \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] = (1 - h_{ii})\\sigma^{2}.\n$$\nIn the last chapter, we established one property of these leverage values in @sec-leverage, namely $\\sum_{i=1}^n h_{ii} = k+ 1$, so $\\sum_{i=1}^n 1- h_{ii} = n - k - 1$ and we have\n$$ \n\\begin{aligned}\n \\E[\\widehat{\\sigma}^{2} \\mid \\Xmat] &= \\frac{1}{n-k-1} \\sum_{i=1}^{n} \\E[\\widehat{e}_{i}^{2} \\mid \\Xmat] \\\\\n &= \\frac{\\sigma^{2}}{n-k-1} \\sum_{i=1}^{n} 1 - h_{ii} \\\\\n &= \\sigma^{2}. \n\\end{aligned}\n$$\nThis establishes $\\E[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat} \\mid \\Xmat] = \\mb{V}^{\\texttt{lm}}_{\\bhat}$. \n\n:::\n\n\nThus, under the linear CEF model and homoskedasticity of the errors, we have an unbiased variance estimator that is a simple function of the sum of squared residuals and the design matrix. Most statistical software packages estimate standard errors using $\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}$. \n\n\nThe final result we can derive for the linear CEF under homoskedasticity is an optimality result. We might ask ourselves if there is another estimator for $\\bfbeta$ that would outperform OLS in the sense of having a lower sampling variance. Perhaps surprisingly, no linear estimator for $\\bfbeta$ has a lower conditional variance, meaning that OLS is the **best linear unbiased estimator**, often jovially shortened to BLUE. This result is famously known as the Gauss-Markov Theorem.\n\n::: {#thm-gauss-markov}\n\nLet $\\widetilde{\\bfbeta} = \\mb{AY}$ be a linear and unbiased estimator for $\\bfbeta$. Under the linear CEF model with homoskedastic errors, \n$$\n\\V[\\widetilde{\\bfbeta}\\mid \\Xmat] \\geq \\V[\\bhat \\mid \\Xmat]. \n$$\n\n:::\n\n::: {.proof}\nNote that if $\\widetilde{\\bfbeta}$ is unbiased then $\\E[\\widetilde{\\bfbeta} \\mid \\Xmat] = \\bfbeta$ and so \n$$\n\\bfbeta = \\E[\\mb{AY} \\mid \\Xmat] = \\mb{A}\\E[\\mb{Y} \\mid \\Xmat] = \\mb{A}\\Xmat\\bfbeta,\n$$\nwhich implies that $\\mb{A}\\Xmat = \\mb{I}_n$. \nRewrite the competitor as $\\widetilde{\\bfbeta} = \\bhat + \\mb{BY}$ where,\n$$ \n\\mb{B} = \\mb{A} - \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'.\n$$\nand note that $\\mb{A}\\Xmat = \\mb{I}_n$ implies that $\\mb{B}\\Xmat = 0$. We now have\n$$ \n\\begin{aligned}\n \\widetilde{\\bfbeta} &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{Y} \\\\\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\mb{B}\\Xmat\\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e} \\\\\n &= \\bfbeta + \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\mb{e}\n\\end{aligned}\n$$\nThe variance of the competitor is, thus, \n$$ \n\\begin{aligned}\n \\V[\\widetilde{\\bfbeta} \\mid \\Xmat]\n &= \\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\V[\\mb{e}\\mid \\Xmat]\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)' \\\\\n &= \\sigma^{2}\\left( \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat' + \\mb{B}\\right)\\left( \\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{B}'\\right) \\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\left(\\Xmat'\\Xmat\\right)^{-1}\\Xmat'\\mb{B}' + \\mb{B}\\Xmat\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &= \\sigma^{2}\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)\\\\\n &\\geq \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1} \\\\\n &= \\V[\\bhat \\mid \\Xmat]\n\\end{aligned}\n$$\nThe first equality comes from the properties of covariance matrices; the second is due to homoskedasticity; the fourth is due to $\\mb{B}\\Xmat = 0$, which implies that $\\Xmat'\\mb{B}' = 0$ as well. The fifth inequality holds because matrix products of the form $\\mb{BB}'$ are positive definite if $\\mb{B}$ is of full rank (which we have assumed it is). \n\n:::\n\nIn this proof, we saw that the variance of the competing estimator had variance $\\sigma^2\\left(\\left(\\Xmat'\\Xmat\\right)^{-1} + \\mb{BB}'\\right)$ which we argued was \"greater than 0\" in the matrix sense, which is also called positive definite. What does this mean practically? Remember that any positive definite matrix must have strictly positive diagonal entries and that the diagonal entries of $\\V[\\bhat \\mid \\Xmat]$ and $V[\\widetilde{\\bfbeta}\\mid \\Xmat]$ are the variances of the individual parameters, $\\V[\\widehat{\\beta}_{j} \\mid \\Xmat]$ and $\\V[\\widetilde{\\beta}_{j} \\mid \\Xmat]$. Thus, the variances of the individual parameters will be larger for $\\widetilde{\\bfbeta}$ than for $\\bhat$.\n\nMany textbooks cite the Gauss-Markov theorem as a critical advantage of OLS over other methods, but it's essential to recognize its limitations. It requires linearity and homoskedastic error assumptions, which can be false in many applications. \n\nFinally, note that while we have shown this result for linear estimators, @Hansen22 proves a more general version of this result that applies to any unbiased estimator. \n\n## The normal linear model\n\nFinally, we add the strongest and thus least loved of the classical linear regression assumption: (conditional) normality of the errors. The historical reason to use this assumption was that finite-sample inference hits a roadblock without some knowledge of the sampling distribution of $\\bhat$. Under the linear CEF model, we saw that it was unbiased, and under homoskedasticity, we could produce an unbiased estimator of the conditional variance. But to do hypothesis testing or generate confidence intervals, we need to be able to make probability statements about the estimator, and for that, we need to know its exact distribution. When the sample size is large, we can rely on the CLT and know it is approximately normal. But in small samples, what do we do? Historically, we decided to assume (conditional) normality of the errors to proceed with some knowledge that we were wrong but hopefully not too wrong. \n\n\n::: {.callout-note}\n\n## The normal linear regression model\n\nIn addition to the linear CEF assumption, we assume that \n$$\ne_i \\mid \\Xmat \\sim \\N(0, \\sigma^2).\n$$\n\n:::\n\nA couple of things to point out: \n\n- The assumption here is not that $(Y_{i}, \\X_{i})$ are jointly normal (though this would be sufficient for the assumption to hold), but rather that $Y_i$ is normally distributed conditional on $\\X_i$. \n- Notice that the normal regression model has the homoskedasticity assumption baked in. \n\n::: {#thm-normal-ols}\n\nUnder the normal linear regression model, we have\n$$ \n\\begin{aligned}\n \\bhat \\mid \\Xmat &\\sim \\N\\left(\\bfbeta, \\sigma^{2}\\left(\\Xmat'\\Xmat\\right)^{-1}\\right) \\\\\n \\frac{\\widehat{\\beta}_{j} - \\beta_{j}}{[\\widehat{\\mb{V}}^{\\texttt{lm}}_{\\bhat}]_{jj}/\\sqrt{n}} &\\sim t_{n-k-1} \\\\\n W/q &\\sim F_{q, n-k-1}. \n\\end{aligned}\n$$\n\n:::\n\n\nThis theorem says that in the normal linear regression model, the coefficients follow a normal distribution, the t-statistics follow a $t$-distribution, and a transformation of the Wald statistic follows an $F$ distribution. These are **exact** results and do not rely on large-sample approximations. Under the assumption of conditional normality of the errors, they are as valid for $n = 5$ as for $n = 500,000$. \n\nFew people believe errors follow a normal distribution, so why even present these results? Unfortunately, most statistical software implementations of OLS implicitly assume this when calculating p-values for tests or constructing confidence intervals. That is, the p-value associated with the $t$-statistic that `lm()` outputs in R relies on the $t_{n-k-1}$ distribution, and the critical values used to construct confidence intervals with `confint()` use that distribution as well. When normality does not hold, there is no principled reason to use the $t$ or the $F$ distributions in this way. But we might hold our nose and use this *ad hoc* procedure under two rationalizations:\n\n- $\\bhat$ is asymptotically normal, but this approximation might be poor in smaller finite samples. The $t$ distribution will make inference more conservative in these cases (wider confidence intervals, smaller test rejection regions), which might help offset the poor approximation of the normal in small samples. \n- As $n\\to\\infty$, the $t_{n-k-1}$ will converge to a standard normal distribution, so the *ad hoc* adjustment will not matter much for medium to large samples. \n\nThese arguments are not very convincing since it's unclear whether the $t$ approximation will be any better than the normal in finite samples. But it's the best we can do to console ourselves as we find more data. \n", "supporting": [ "08_ols_properties_files/figure-pdf" ], diff --git a/_freeze/08_ols_properties/figure-pdf/fig-wald-1.pdf b/_freeze/08_ols_properties/figure-pdf/fig-wald-1.pdf index e150ee9ffd40672df898ce6dafee21b73d30b2bd..21f33c654ac5b40bcd2c106fc77712e15fbed7c1 100644 GIT binary patch delta 183 zcmX@Sn)&c*<_$}asu~&_7@C<_n3`#F>HFrVxFnXODrmS^85tNE8^Yx_pE}yj%LC`? z8kkN_yey6)u({>(1IBnGS2tG|7fVA!S2GtEQ*$R5Q)eSfQ&&R^7gtADR|5+>1sj4& fVioM{xQa^>i%KerQq#ChO^htexKveL{oS|#%(pS8 delta 183 zcmX@Sn)&c*<_$}asu~!W8JZfHn3!vF>HFrVxFnXODrmS^85tNE8^Yx_pE}yj%LC`? z8kkH@yey6)u({>(1IBn0Q%4t57ZVpFXA37|b5m0@6IV+oV>c56Gb0N#OG8sT1sj4& eVioM{xQa^>i%KerQq#ChO@LatR8?L5-M9dQF))n)