From 982cc9231b6ab0642830a1155e8f0442b5a81a13 Mon Sep 17 00:00:00 2001 From: Matthew Blackwell Date: Thu, 26 Oct 2023 20:25:27 -0400 Subject: [PATCH] ch 4 typos (fixes #23; fixes #24; fixes #25; fixes #26) --- 04_hypothesis_tests.qmd | 8 ++++---- .../02_estimation/execute-results/tex.json | 4 ++-- _freeze/02_estimation/figure-pdf/mse-1.pdf | Bin 13637 -> 13637 bytes .../03_asymptotics/execute-results/tex.json | 4 ++-- .../figure-pdf/fig-ci-sim-1.pdf | Bin 6292 -> 6292 bytes .../03_asymptotics/figure-pdf/fig-clt-1.pdf | Bin 11826 -> 11826 bytes .../03_asymptotics/figure-pdf/fig-delta-1.pdf | Bin 6076 -> 6076 bytes .../figure-pdf/fig-lln-sim-1.pdf | Bin 15862 -> 16305 bytes .../figure-pdf/fig-std-normal-1.pdf | Bin 7953 -> 7953 bytes .../execute-results/html.json | 4 ++-- .../execute-results/tex.json | 4 ++-- .../figure-pdf/fig-shape-of-t-1.pdf | Bin 6239 -> 6239 bytes .../figure-pdf/fig-size-power-1.pdf | Bin 9260 -> 9260 bytes .../figure-pdf/fig-two-sided-1.pdf | Bin 8104 -> 8104 bytes 14 files changed, 12 insertions(+), 12 deletions(-) diff --git a/04_hypothesis_tests.qmd b/04_hypothesis_tests.qmd index cead34d..5a989a1 100644 --- a/04_hypothesis_tests.qmd +++ b/04_hypothesis_tests.qmd @@ -171,7 +171,7 @@ A test has **significance level** $\alpha$ if its size is less than or equal to ::: -A test with a significance level of $\alpha = 0.05$ will have a false positive/type I error rate no larger than 0.05. This level is widespread in the social sciences, though you also will see $\alpha = 0.01$ or $\alpha = 0.1$. Frequentists justify this by saying this means that with $\alpha = 0.05$, there will only be 5% of studies that will produce false discoveries. +A test with a significance level of $\alpha = 0.05$ will have a false positive/type I error rate no larger than 0.05. This level is widespread in the social sciences, though you also will see $\alpha = 0.01$ or $\alpha = 0.1$. Frequentists justify this by saying this means that with $\alpha = 0.05$, there will only be at most 5% of studies that will produce false discoveries. Our task is to construct the rejection region so that the **null distribution** of the test statistic $G_0(t) = \P(T \leq t \mid \theta_0)$ has less than $\alpha$ probability in that region. One-sided tests like in @fig-size-power are the easiest to show, even though we warned you not to use them. We want to choose $c$ that puts no more than $\alpha$ probability in the tail, or $$ @@ -323,7 +323,7 @@ We can use this estimator to derive the Wald test statistic of $$ T = \frac{\widehat{\tau} - 0}{\widehat{\se}[\widehat{\tau}]} = \frac{\Ybar - \Xbar}{\sqrt{\frac{s^2_t}{n_t} + \frac{s^2_c}{n_c}}}, $$ -and if we want an asymptotically level of 0.05, we can reject when $|T| > 1.96$. +and if we want an asymptotic level of 0.05, we can reject when $|T| > 1.96$. ::: @@ -335,7 +335,7 @@ One alternative to reporting the reject/retain decision is to report a **p-value ::: {#def-p-value} -The **p-value** of a test is the probability of observing a test statistic is at least as extreme as the observed test statistic in the direction of the alternative hypothesis. +The **p-value** of a test is the probability of observing a test statistic at least as extreme as the observed test statistic in the direction of the alternative hypothesis. ::: @@ -356,7 +356,7 @@ There is a lot of controversy surrounding p-values but most of it focuses on arb People use many statistical shibboleths to purportedly identify people who don't understand statistics and usually hinge on seemingly subtle differences in interpretation that are easy to miss. If you know the core concepts, the statistical shibboleths tend to be overblown, but it would be malpractice not to flag them for you. -The shibboleth with p-values is that sometimes people interpret them as "the probability that the null hypothesis is true." Of course, this doesn't make sense from our definition because the p-values *conditions* on the null hypothesis---it cannot tell us anything about the probability of that null hypothesis. Instead, the metaphor you should always carry is that hypothesis tests are statistical thought experiments and that p-values answer the question: how likely would my data be if the null were true? +The shibboleth with p-values is that sometimes people interpret them as "the probability that the null hypothesis is true." Of course, this doesn't make sense from our definition because the p-value *conditions* on the null hypothesis---it cannot tell us anything about the probability of that null hypothesis. Instead, the metaphor you should always carry is that hypothesis tests are statistical thought experiments and that p-values answer the question: how likely would my data be if the null were true? ::: diff --git a/_freeze/02_estimation/execute-results/tex.json b/_freeze/02_estimation/execute-results/tex.json index 5746dba..3b76ab7 100644 --- a/_freeze/02_estimation/execute-results/tex.json +++ b/_freeze/02_estimation/execute-results/tex.json @@ -1,7 +1,7 @@ { - "hash": "af9fef464895429f29f9e4ee822bdba5", + "hash": "4d964706c4457aeed3e5322dfe646944", "result": { - "markdown": "# Estimation\n \n\n## Introduction\n\nWhen studying probability, we assumed that we knew the parameter of a distribution (the mean or the variance) and used probability theory to understand what kind of data we would observe. Estimation and inference put this engine in reverse and try to learn some aspect of the data-generating process using only our observed data. There are two main goals here: **estimation**, which is how we formulate our best guess about a parameter of the DGP, and **inference**, which is how we formalize and express uncertainty about our estimates. \n\n![](assets/img/two-direction.png)\n\n\n\n::: {#exm-rct}\n\n\n\n## Randomized control trial\n\nSuppose we are conducting a randomized experiment on framing effects. All respondents receive some factual information about current levels of immigration. The message for the treatment group ($D_i = 1$) has an additional framing of the positive benefits of immigration, while the control group ($D_i = 0$) receives no additional framing. The outcome is a binary outcome on whether the respondent supports increasing legal immigration limits ($Y_i = 1$)\nor not ($Y_i = 0$). The observed data consists of $n$ pairs of random variables, the outcome, and the treatment assignment: $\\{(Y_1, D_1), \\ldots, (Y_n, D_n)\\}$. Define the two sample means/proportions in each group as \n$$\n\\Ybar_1 = \\frac{1}{n_1} \\sum_{i: D_i = 1} Y_i, \\qquad\\qquad \\Ybar_0 = \\frac{1}{n_0} \\sum_{i: D_i = 0} Y_i,\n$$\nwhere $n_1 = \\sum_{i=1}^n D_i$ is the number of treated units and $n_0 = n - n_1$ is the number of control units. \n\nA standard estimator for the treatment effect in a study like this would be the difference in means, $\\Ybar_1 - \\Ybar_0$. But this is only one possible estimator. We could also estimate the effect by taking this difference in means separately by party identification and then averaging those party-specific effects by the size of those groups. This estimator is commonly called a **poststratification** estimator, but it's unclear at first glance which of these two estimators we should prefer. \n\n:::\n\n\nWhat are the goals of studying estimators? In short, we prefer to use **good** estimators rather than **bad** estimators. But what makes an estimator good or bad? You probably have some intuitive sense that, for example, an estimator that always returns the value 3 is bad. Still, it will be helpful for us to formally define and explore some properties of estimators that will allow us to compare them and choose the good over the bad. \n\n\n## Samples and populations\n\n\n\n\nFor most of this class, we'll focus on a relatively simple setting where we have a set of random vectors $\\{X_1, \\ldots, X_n\\}$ that are **independent and identically distributed** (iid) draws from a distribution with cumulative distribution function (cdf) $F$. They are independent in that information about any subset of random vectors is not informative about any other subset of random vectors, or, more formally, \n$$\nF_{X_{1},\\ldots,X_{n}}(x_{1}, \\ldots, x_{n}) = F_{X_{1}}(x_{1})\\cdots F_{X_{n}}(x_{n}),\n$$\nwhere $F_{X_{1},\\ldots,X_{n}}(x_{1}, \\ldots, x_{n})$ is the joint cdf of the random vectors and $F_{X_{j}}(x_{j})$ is the marginal cdf of the $j$th random vector. They are \"identically distributed\" in the sense that each of the random variables $X_i$ have the same marginal distribution, $F$.\n\n[^model]: This approach to inference is often called a **model-based approach** since we are assuming a probability model in the cdf, $F$. This is usually in contrast to a **design-based approach** to inference that views the population of interest as a finite group with fixed traits and the only randomness comes from the random sampling procedure. \n\n\n\nYou can think of each vector, $X_i$, as the rows in your data frame. Note that we're being purposely vague about this cdf---it simply represents the unknown distribution of the data, otherwise known as the **data generating process** (DGP). Sometimes $F$ is also referred to as the **population distribution** or even just **population**, which has its roots in viewing the data as a random sample from some larger population.[^model] As a shorthand, we often say that the collection of random vectors $\\{X_1, \\ldots, X_n\\}$ is a **random sample** from population $F$ if $\\{X_1, \\ldots, X_n\\}$ is iid with distribution $F$. The **sample size** $n$ is the number of units in the sample. \n\n\nTwo metaphors can help build intuition about the concept of viewing the data as an iid draw from $F$:\n\n1. **Random sampling**. Suppose we have a population of size $N$ that is much larger than our sample size $n$, and we take a random sample of size $n$ from this population with replacement. Then the distribution of the data in the random sample will be iid draws from the population distribution of the variables we are sampling. For instance, suppose we take a random sample from a population of US citizens where the population proportion of Democratic party identifiers is 0.33. Then if we randomly sample $n = 100$ US citizens, each data point $X_i$ will be distributed Bernoulli with probability of success 0.33. \n2. **Groundhog Day**. Random sampling does not always make sense as a justification for iid data, especially when the units are not samples at all but rather countries, states, or subnational units. In this case, we have to appeal to a thought experiment where $F$ represents the fundamental uncertainty in the data-generating process. The metaphor here is that if we could re-run history many times, like the 1993 American classic comedy *Groundhog Day*, data and outcomes would change slightly due to the inherently stochastic nature of the world. The iid assumption, then, is that each of the units in our data has the same DGP producing this data or the same distribution of outcomes under the *Groundhog Day* scenario. The set of all these infinite possible draws from the DGP is sometimes referred to as the **superpopulation**. \n\nNote that there are many situations where the iid assumption is not appropriate. We will cover some of those later in the semester. But much of the innovation and growth in statistics over the last 50 years has been figuring out how to perform statistical inference when iid does not hold. Often, the solutions are specific to the type of iid violation you have (spatial, time-series, network, or clustered). As a rule of thumb, though, if you suspect iid is incorrect, your uncertainty statements will likely be overconfident (for example, confidence intervals, which we'll cover later, are too small). \n\n## Point estimation\n\n### Quantities of interest\n\nWe aim to learn about the data-generating process, represented by the cdf, $F$. We might be interested in estimating the cdf at a general level or only some feature of the distribution, like a mean or conditional expectation function. We will almost always have a particular quantity in mind, but we'll introduce estimation at a general level. So we'll let $\\theta$ represent the quantity of interest. **Point estimation** describes how we obtain a single \"best guess\" about $\\theta$.\n\n::: {.callout-note}\n\nSome refer to quantities of interest as **parameters** or **estimands** (that is, the target of estimation).\n\n:::\n\n\n::: {#exm-prop}\n\n## Population mean\n\nSuppose we wanted to know the proportion of US citizens who support increasing legal immigration in the US, which we denote as $Y_i = 1$. Then our quantity of interest is the mean of this random variable, $\\mu = \\E[Y_i]$, which is the probability of randomly drawing someone from the population supporting increased legal immigration. \n\n:::\n\n::: {#exm-var}\n\n## Population variance\n\nFeeling thermometer scores are a prevalent way to assess how a survey respondent feels about a particular person or group. A survey asks respondents how warmly they feel about a group from 0 to 100, which we will denote $Y_i$. We might be interested in how polarized views are on a group in the population, and one measure of polarization could be the variance, or spread, of the distribution of $Y_i$ around the mean. In this case, $\\sigma^2 = \\V[Y_i]$ would be our quantity of interest. \n\n:::\n\n\n::: {#exm-rct-ii}\n\n## RCT continued\n\nIn @exm-rct, we discussed a typical estimator for an experimental study with a binary treatment. The goal of that experiment is to learn about the difference between two conditional probabilities (or expectations): the average support for increasing legal immigration in the treatment group, $\\mu_1 = \\E[Y_i \\mid D_i = 1]$, and the same average in the control group, $\\mu_0 = \\E[Y_i \\mid D_i = 0]$. This difference, $\\mu_1 - \\mu_0$, is a function of unknown features of these two conditional distributions. \n\n:::\n\nEach of these is a function of the (possibly joint) distribution of the data, $F$. In each of these, we are not necessarily interested in the entire distribution, just summaries of it (central tendency, spread). Of course, there are situations where we are also interested in the complete distribution. \n\n### Estimators \n\nWhen our sample size is more than a few observations, it makes no sense to work with the raw data, $X_1, \\ldots, X_n$, and we inevitably will need to *summarize* the data in some way. We can represent this summary as a function, $g(x_1, \\ldots, x_n)$, which might be the formula for the sample mean or sample variance. This function is just a regular function that takes in $n$ numbers (or vectors) and returns a number (or vector). We can also define a random variable based on this function, $Y = g(X_1, \\ldots, X_n)$, which inherits its randomness from the randomness of the data. Before we see the data, we don't know what values of $X_1, \\ldots, X_n$ we will see, so we don't know what value of $Y$ we'll see either. We call the random variable $Y = g(X_1, \\ldots, X_n)$ a **statistic** (or sometimes sample statistics), and we refer to the probability distribution of a statistic $Y$ as the **sampling distribution** of $Y$. \n\n\n::: {.callout-warning}\n\nThere is one potential confusion in how we talk about \"statistics.\" Just above, we defined a statistic as a random variable based on it being a function of random variables (the data). But we sometimes refer to the calculated value as a statistic as well, which is a specific number that you see in your R output. To be precise, we should call the latter the **realized value** of the statistic, but message discipline is difficult to enforce in this context. A simple example might help. Suppose that $X_1$ and $X_2$ are the results of a roll of two standard six-sided dice. Then the statistic $Y = X_1 + X_2$ is a random variable that has a distribution over the numbers from $\\{2, \\ldots, 12\\}$ that describes our uncertainty over what the sum will be *before we roll the dice*. Once we have rolled the dice and observed the realized values $X_1 = 3$ and $X_2 = 4$, we observed the realized value of the statistic, $Y = 7$. \n\n:::\n\nAt their most basic, statistics are just data summaries without aim or ambition. Estimators are statistics with a purpose: to provide an \"educated guess\" about some quantity of interest. \n\n::: {#def-estimator}\nAn **estimator** $\\widehat{\\theta}_n = \\theta(X_1, \\ldots, X_n)$ for some parameter $\\theta$, is a statistic intended as a guess about $\\theta$.\n:::\n\nOne important distinction of jargon is between an estimator and an estimate, similar to the issues with \"statistic\" described above. The estimator is a function of the data, whereas the **estimate** is the *realized value* of the estimator once we see the data. An estimate is a single number, such as 0.38, whereas the estimator is a random variable that has uncertainty over what value it will take. Formally, the estimate is $\\theta(x_1, \\ldots, x_n)$ when the data is $\\{X_1, \\ldots, X_n\\} = \\{x_1, \\ldots, x_n\\}$, whereas we represent the estimator as a function of random variables, $\\widehat{\\theta}_n = \\theta(X_1, \\ldots, X_n)$. \n\n::: {.callout-note}\n\nIt is widespread, though not universal, to use the \"hat\" notation to define an estimator and its estimand. For example, $\\widehat{\\theta}$ (or \"theta hat\") indicates that this estimator is targeting the parameter $\\theta$. \n\n:::\n\n::: {#exm-mean-est}\n\n## Estimators for the population mean\n\nSuppose we would like to estimate the population mean of $F$, which we will represent as $\\mu = \\E[X_i]$. We could choose from several estimators, all with different properties. \n$$\n\\widehat{\\theta}_{n,1} = \\frac{1}{n} \\sum_{i=1}^n X_i, \\quad \\widehat{\\theta}_{n,2} = X_1, \\quad \\widehat{\\theta}_{n,3} = \\text{max}(X_1,\\ldots,X_n), \\quad \\widehat{\\theta}_{n,4} = 3\n$$\nThe first is just the sample mean, which is an intuitive and natural estimator for the population mean. The second just uses the first observation. While this seems silly, this is a valid statistic (it's a function of the data!). The third takes the maximum value in the sample, and the fourth always returns three, regardless of the data. \n:::\n\n## How to find estimators\n\nWhere do estimators come from? There are a couple of different methods that I'll cover briefly here before describing the ones that will form the bulk of this class. \n\n### Parametric models and maximum likelihood \n\nThe first method for generating estimators relies on **parametric models**, where the researcher specifies the exact distribution (up to some unknown parameters) of the DGP. Let $\\theta$ be the parameters of this distribution and we then write $\\{X_1, \\ldots, X_n\\}$ are iid draws from $F_{\\theta}$. We should also formally state the set of possible values the parameters can take, which we call the **parameter space** and usually denote as $\\Theta$. Because we're assuming we know the distribution of the data, we can write the p.d.f. as $f(X_i \\mid \\theta)$ and define the likelihood function as the product of these p.d.f.s over the units as a function of the parameters:\n$$\nL(\\theta) = \\prod_{i=1}^n f(X_i \\mid \\theta).\n$$\nWe can then define the **maximum likelihood** estimator (MLE) for $\\theta$ as the values of the parameter that, well, maximize the likelihood:\n$$\n\\widehat{\\theta}_{mle} = \\argmax_{\\theta \\in \\Theta} \\; L(\\theta)\n$$\nSometimes we can use calculus to derive a closed-form expression for the MLE. Still, we often use iterative techniques that search the parameter space for the maximum. \n\nMaximum likelihood estimators have very nice properties, especially in large samples. Unfortunately, they also require the correct knowledge of the parametric model, which is often difficult to justify. Do we really know if we should model a given event count variable as Poisson or Negative Binomial? The attractive properties of MLE are only as good as our ability to specify the parametric model. \n\n::: {.callout-note}\n\n## No free lunch\n\nOne essential intuition to build about statistics is the **assumptions-precision tradeoff**. You can usually get more precise estimates if you make stronger and potentially more fragile assumptions. Conversely, you will almost always get less accurate estimates if you weaken your assumptions.\n\n:::\n\n### Plug-in estimators\n\nThe second broad class of estimators is **semiparametric** in that we will specify some finite-dimensional parameters of the DGP but leave the rest of the distribution unspecified. For example, we might define a population mean, $\\mu = \\E[X_i]$, and a population variance, $\\sigma^2 = \\V[X_i]$ but leave unrestricted the shape of the distribution. This approach ensures that our estimators will be less dependent on correctly specifying distributions we have little intuition about. \n\nThe primary method for constructing estimators in this setting is to use the **plug-in estimator**, or the estimator that replaces any population mean with a sample mean. Obviously, in the case of estimating the population mean, $\\mu$, this means we will use the **sample mean** as its estimate:\n$$\n\\Xbar_n = \\frac{1}{n} \\sum_{i=1}^n X_i \\quad \\text{estimates} \\quad \\E[X_i] = \\int_{\\mathcal{X}} x f(x)dx\n$$\nWhat are we doing here? We are replacing the unknown population distribution $f(x)$ in the population mean with a discrete uniform distribution over our data points, with $1/n$ probability assigned to each unit. Why do this? It encodes that if we have a random sample, our best guess about the population distribution of $X_i$ is the sample distribution in our actual data. If this intuition fails, you can hold onto an analog principle: sample means of random variables are natural estimators of population means. \n\nWhat about estimating something more complicated, like the expected value of a function of the data, $\\theta = \\E[r(X_i)]$? The key is to see that $f(X_i)$ is also a random variable. Let's call this random variable $Y_i = f(X_i)$. Now we can see that $\\theta$ is just the population expectation of this random variable, and using the plug-in estimator, we get:\n$$\n\\widehat{\\theta} = \\frac{1}{n} \\sum_{i=1}^n Y_i = \\frac{1}{n} \\sum_{i=1}^n r(X_i). \n$$\n\nWith these facts in hand, we can describe the more general plug-in estimator. When we want to estimate some quantity of interest that is a function of population means, we can generate a plug-in estimator by replacing any population mean with a sample mean. Formally, let $\\alpha = g\\left(\\E[r(X_i)]\\right)$ be a parameter that is defined as a function of the population mean of a (possibly vector-valued) function of the data. Then, we can estimate this parameter by plugging in the sample mean for the population mean to get the **plug-in estimator**,\n$$\n\\widehat{\\alpha} = g\\left( \\frac{1}{n} \\sum_{i=1}^n r(X_i) \\right) \\quad \\text{estimates} \\quad \\alpha = g\\left(\\E[r(X_i)]\\right)\n$$\nThis approach to plug-in estimation with sample means is very general and will allow us to derive estimators in various settings. \n\n::: {#exm-var-est}\n\n## Estimating population variance\n\nThe population variance of a random variable is $\\sigma^2 = \\E[(X_i - \\E[X_i])^2]$. To derive a plug-in estimator for this quantity, we replace the inner $\\E[X_i]$ with $\\Xbar_n$ and the outer expectation with another sample mean:\n$$\n\\widehat{\\sigma}^2 = \\frac{1}{n} \\sum_{i=1}^n (X_i - \\Xbar_n)^2.\n$$\nThis plug-in estimator differs from the standard sample variance, which divides by $n - 1$ rather than $n$. This minor difference does not matter in moderate to large samples.\n:::\n\n::: {#exm-cov-est}\n\n## Estimating population covariance\n\nSuppose we have two variables, $(X_i, Y_i)$. A natural quantity of interest here is the population covariance between these variables, \n$$\n\\sigma_{xy} = \\text{Cov}[X_i,Y_i] = \\E[(X_i - \\E[X_i])(Y_i-\\E[Y_i])],\n$$\nwhich has the plug-in estimator,\n$$\n\\widehat{\\sigma}_{xy} = \\frac{1}{n} \\sum_{i=1}^n (X_i - \\Xbar_n)(Y_i - \\Ybar_n).\n$$\n:::\n\n::: {.callout-note}\n\n## Notation alert\n\nGiven the connection between the population mean and the sample mean, you will sometimes see the $\\E_n[\\cdot]$ operator used as a shorthand for the sample average:\n$$\n\\E_n[r(X_i)] \\equiv \\frac{1}{n} \\sum_{i=1}^n r(X_i).\n$$\n\n:::\n\nFinally, plug-in estimation goes beyond just replacing population means with sample means. We can derive estimators of the population quantiles like the median with sample versions of those quantities. What unifies all of these approaches is replacing the unknown population cdf, $F$, with the empirical cdf, \n$$\n\\widehat{F}_n(x) = \\frac{\\sum_{i=1}^n \\mathbb{I}(X_i \\leq x)}{n}.\n$$\nFor a more complete and technical treatment of these ideas, see Wasserman (2004) Chapter 7.\n\n\n\n## The three distributions: population, empirical, and sampling\n\nOnce we start to wade into estimation, there are several distributions to keep track of, and things can quickly become confusing. Three specific distributions are all related and easy to confuse, but keeping them distinct is crucial. \n\nThe **population distribution** is the distribution of the random variable, $X_i$, which we have labeled $F$ and is our target of inference. Then there is the **empirical distribution**, which is the distribution of the actual realizations of the random variables in our samples (that is, the numbers in our data frame), $X_1, \\ldots, X_n$. Because this is a random sample from the population distribution and can serve as an estimator of $F$, we sometimes call this $\\widehat{F}_n$. \n\n\n\n \n\n\nSeparately from both is the **sampling distribution of an estimator**, which is the probability distribution of $\\widehat{\\theta}_n$. It represents our uncertainty about our estimate before we see the data. Remember that our estimator is itself a random variable because it is a function of random variables: the data itself. That is, we defined the estimator as $\\widehat{\\theta}_n = \\theta(X_1, \\ldots, X_n)$. \n\n::: {#exm-three-dist}\n## Likert responses\n\nSuppose $X_i$ is the answer to a question, \"How much do you agree with the following statement: Immigrants are a net positive for the United States,\" with a $X_i = 0$ being \"strongly disagree,\" $X_i = 1$ being \"disagree,\" $X_i = 2$ being \"neither agree nor disagree,\" $X_i = 3$ being \"agree,\" and $X_i = 4$ being \"strongly agree.\"\n\nThe population distribution describes the probability of randomly selecting a person with each one of these values, $\\P(X_i = x)$. The empirical distribution would be the fraction of our data taking each value. And the sampling distribution of the sample mean, $\\Xbar_n$, would be the distribution of the sample mean across repeated samples from the population. \n\nSuppose the population distribution was binomial with four trials and probability of success $p = 0.4$. We could generate one sample with $n = 10$ and thus one empirical distribution using `rbinom()`:\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_samp <- rbinom(n = 10, size = 5, prob = 0.4)\nmy_samp\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 3 2 3 4 0 2 3 2 2\n```\n:::\n\n```{.r .cell-code}\ntable(my_samp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nmy_samp\n0 1 2 3 4 \n1 1 4 3 1 \n```\n:::\n:::\n\n\n\nAnd we can generate one draw from the sampling distribution of $\\Xbar_n$ by taking the mean of this sample:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(my_samp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2.2\n```\n:::\n:::\n\n\n \nBut, if we had a different sample, it would have a different empirical distribution and thus give us a different estimate of the sample mean:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_samp2 <- rbinom(n = 10, size = 5, prob = 0.4)\nmean(my_samp2) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2\n```\n:::\n:::\n\n\n\nThe sampling distribution is the distribution of these sample means across repeated sampling.\n \n\n:::\n\n\n## Finite-sample properties of estimators\n\nAs we discussed when we introduced estimators, their usefulness depends on how well they help us learn about the quantity of interest. If we get an estimate $\\widehat{\\theta} = 1.6$, we would like to know that this is \"close\" to the true parameter $\\theta$. The sampling distribution is the key to answering these questions. Intuitively, we would like the sampling distribution of $\\widehat{\\theta}_n$ to be as tightly clustered around the true as $\\theta$ as possible. Here, though, we run into a problem: the sampling distribution depends on the population distribution since it is about repeated samples of the data from that distribution filtered through the function $\\theta()$. Since $F$ is unknown, this implies that the sampling distribution will also usually be unknown. \n\nEven though we cannot precisely pin down the entire sampling distribution, we can use assumptions to derive specific properties of the sampling distribution that will be useful in comparing estimators. \n\n\n### Bias\n\nThe first property of the sampling distribution concerns its central tendency. In particular, we will define the **bias** (or **estimation bias**) of estimator $\\widehat{\\theta}$ for parameter $\\theta$ as\n$$\n\\text{bias}[\\widehat{\\theta}] = \\E[\\widehat{\\theta}] - \\theta,\n$$\nwhich is the difference between the mean of the estimator (across repeated samples) and the true parameter. All else equal, we would like estimation bias to be as small as possible. The smallest possible bias, obviously, is 0, and we define an **unbiased estimator** as one with $\\text{bias}[\\widehat{\\theta}] = 0$ or equivalently, $\\E[\\widehat{\\theta}] = \\theta$. \n\nHowever, all else is not always equal, and unbiasedness is not a property to become overly attached to. Many biased estimators have other attractive properties, and many popular modern estimators are biased. \n\n::: {#exm-mean-unbiased}\n\n## Unbiasedness of the sample mean\nWe can show that the sample mean is unbiased for the population mean when the data is iid and $\\E|X| < \\infty$. In particular, we simply apply the rules of expectations:\n$$\\begin{aligned}\n\\E\\left[ \\Xbar_n \\right] &= \\E\\left[\\frac{1}{n} \\sum_{i=1}^n X_i\\right] & (\\text{definition of } \\Xbar_n) \\\\\n&= \\frac{1}{n} \\sum_{i=1}^n \\E[X_i] & (\\text{linearity of } \\E)\\\\\n&= \\frac{1}{n} \\sum_{i=1}^n \\mu & (X_i \\text{ identically distributed})\\\\\n&= \\mu.\n\\end{aligned}$$\nNotice that we only used the \"identically distributed\" part of iid. Independence is not needed. \n\n:::\n\n::: {.callout-warning}\n\nProperties like unbiasedness might only hold for a subset of DGPs. For example, we just showed that the sample mean is unbiased, but only when the population mean is finite. There are probability distributions like the Cauchy where the expected value diverges and is not finite. So we are dealing with a restricted class of DGPs that rules out such distributions. You may see this sometimes formalized by defining a class $\\mathcal{F}$ of distributions, and unbiasedness might hold in that class if it is unbiased for all $F \\in \\mathcal{F}$. \n\n:::\n\n\n### Estimation variance and the standard error\n\nIf a \"good\" estimator tends to be close to the truth, we should also care about the spread of the sampling distribution. In particular, we define the **sampling variance** as the variance of an estimator's sampling distribution, $\\V[\\widehat{\\theta}]$, which measures how spread out the estimator is around its mean. For an unbiased estimator, lower sampling variance implies the distribution of $\\widehat{\\theta}$ is more concentrated around the truth. \n\n\n::: {#exm-mean-var}\n\n## Sampling variance of the sample mean\n\nWe can establish the sampling variance of the sample mean of iid data for all $F$ such that $\\V[X_i]$ is finite (more precisely, $\\E[X_i^2] < \\infty$)\n\n$$\\begin{aligned}\n \\V\\left[ \\Xbar_n \\right] &= \\V\\left[ \\frac{1}{n} \\sum_{i=1}^n X_i \\right] & (\\text{definition of } \\Xbar_n) \\\\\n &= \\frac{1}{n^2} \\V\\left[ \\sum_{i=1}^n X_i \\right] & (\\text{property of } \\V) \\\\\n &= \\frac{1}{n^2} \\sum_{i=1}^n \\V[X_i] & (\\text{independence}) \\\\\n &= \\frac{1}{n^2} \\sum_{i=1}^n \\sigma^2 & (X_i \\text{ identically distributed}) \\\\\n &= \\frac{\\sigma^2}{n}\n\\end{aligned}$$\n\n:::\n\nAn alternative measure of spread for any distribution is the standard deviation, which is on the same scale as the original random variable. We call the standard deviation of the sampling distribution of $\\widehat{\\theta}$ the **standard error** of $\\widehat{\\theta}$: $\\se(\\widehat{\\theta}) = \\sqrt{\\V[\\widehat{\\theta}]}$. \n\nGiven the above derivation, the standard error of the sample mean under iid sampling is $\\sigma / \\sqrt{n}$. \n\n\n### Mean squared error\n\nBias and sampling variance measure two different aspects of being a \"good\" estimator. Ideally, we want the estimator to be as close as possible to the true value. One summary measure of the quality of an estimator is the **mean squared error** or **MSE**, which is \n$$\n\\text{MSE} = \\E[(\\widehat{\\theta}_n-\\theta)^2].\n$$\nIdeally, we would have this be as small as possible!\n\nWe can also relate the MSE to the bias and the sampling variance (provided it is finite) with the following decomposition result: \n$$\n\\text{MSE} = \\text{bias}[\\widehat{\\theta}_n]^2 + \\V[\\widehat{\\theta}_n]\n$$\nThis decomposition implies that, for unbiased estimators, MSE is the sampling variance. It also highlights why we might accept some bias for significant reductions in variance for lower overall MSE. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two sampling distributions](02_estimation_files/figure-pdf/mse-1.pdf)\n:::\n:::\n\n\n\nIn this figure, we show the sampling distributions of two estimators, $\\widehat{\\theta}_a$, which is unbiased (centered on the true value $\\theta$) but with a high sampling variance, and $\\widehat{\\theta}_b$ which is slightly biased but with much lower sampling variance. Even though $\\widehat{\\theta}_b$ is biased, the probability of drawing a value close to the truth is higher than for $\\widehat{\\theta}_a$. This balancing between bias and variance is precisely what the MSE helps capture and, indeed, in this case, $MSE[\\widehat{\\theta}_b] < MSE[\\widehat{\\theta}_a]$. \n\n\n\n", + "markdown": "# Estimation\n \n\n## Introduction\n\nWhen studying probability, we assumed that we knew the parameter of a distribution (the mean or the variance) and used probability theory to understand what kind of data we would observe. Estimation and inference put this engine in reverse and try to learn some aspect of the data-generating process using only our observed data. There are two main goals here: **estimation**, which is how we formulate our best guess about a parameter of the DGP, and **inference**, which is how we formalize and express uncertainty about our estimates. \n\n![](assets/img/two-direction.png)\n\n\n\n::: {#exm-rct}\n\n\n\n## Randomized control trial\n\nSuppose we are conducting a randomized experiment on framing effects. All respondents receive some factual information about current levels of immigration. The message for the treatment group ($D_i = 1$) has an additional framing of the positive benefits of immigration, while the control group ($D_i = 0$) receives no additional framing. The outcome is a binary outcome on whether the respondent supports increasing legal immigration limits ($Y_i = 1$)\nor not ($Y_i = 0$). The observed data consists of $n$ pairs of random variables, the outcome, and the treatment assignment: $\\{(Y_1, D_1), \\ldots, (Y_n, D_n)\\}$. Define the two sample means/proportions in each group as \n$$\n\\Ybar_1 = \\frac{1}{n_1} \\sum_{i: D_i = 1} Y_i, \\qquad\\qquad \\Ybar_0 = \\frac{1}{n_0} \\sum_{i: D_i = 0} Y_i,\n$$\nwhere $n_1 = \\sum_{i=1}^n D_i$ is the number of treated units and $n_0 = n - n_1$ is the number of control units. \n\nA standard estimator for the treatment effect in a study like this would be the difference in means, $\\Ybar_1 - \\Ybar_0$. But this is only one possible estimator. We could also estimate the effect by taking this difference in means separately by party identification and then averaging those party-specific effects by the size of those groups. This estimator is commonly called a **poststratification** estimator, but it's unclear at first glance which of these two estimators we should prefer. \n\n:::\n\n\nWhat are the goals of studying estimators? In short, we prefer to use **good** estimators rather than **bad** estimators. But what makes an estimator good or bad? You probably have some intuitive sense that, for example, an estimator that always returns the value 3 is bad. Still, it will be helpful for us to formally define and explore some properties of estimators that will allow us to compare them and choose the good over the bad. \n\n\n## Samples and populations\n\n\n\n\nFor most of this class, we'll focus on a relatively simple setting where we have a set of random vectors $\\{X_1, \\ldots, X_n\\}$ that are **independent and identically distributed** (iid) draws from a distribution with cumulative distribution function (cdf) $F$. They are independent in that information about any subset of random vectors is not informative about any other subset of random vectors, or, more formally, \n$$\nF_{X_{1},\\ldots,X_{n}}(x_{1}, \\ldots, x_{n}) = F_{X_{1}}(x_{1})\\cdots F_{X_{n}}(x_{n}),\n$$\nwhere $F_{X_{1},\\ldots,X_{n}}(x_{1}, \\ldots, x_{n})$ is the joint cdf of the random vectors and $F_{X_{j}}(x_{j})$ is the marginal cdf of the $j$th random vector. They are \"identically distributed\" in the sense that each of the random variables $X_i$ have the same marginal distribution, $F$.\n\n[^model]: This approach to inference is often called a **model-based approach** since we are assuming a probability model in the cdf, $F$. This is usually in contrast to a **design-based approach** to inference that views the population of interest as a finite group with fixed traits and the only randomness comes from the random sampling procedure. \n\n\n\nYou can think of each vector, $X_i$, as the rows in your data frame. Note that we're being purposely vague about this cdf---it simply represents the unknown distribution of the data, otherwise known as the **data generating process** (DGP). Sometimes $F$ is also referred to as the **population distribution** or even just **population**, which has its roots in viewing the data as a random sample from some larger population.[^model] As a shorthand, we often say that the collection of random vectors $\\{X_1, \\ldots, X_n\\}$ is a **random sample** from population $F$ if $\\{X_1, \\ldots, X_n\\}$ is iid with distribution $F$. The **sample size** $n$ is the number of units in the sample. \n\n\nTwo metaphors can help build intuition about the concept of viewing the data as an iid draw from $F$:\n\n1. **Random sampling**. Suppose we have a population of size $N$ that is much larger than our sample size $n$, and we take a random sample of size $n$ from this population with replacement. Then the distribution of the data in the random sample will be iid draws from the population distribution of the variables we are sampling. For instance, suppose we take a random sample from a population of US citizens where the population proportion of Democratic party identifiers is 0.33. Then if we randomly sample $n = 100$ US citizens, each data point $X_i$ will be distributed Bernoulli with probability of success 0.33. \n2. **Groundhog Day**. Random sampling does not always make sense as a justification for iid data, especially when the units are not samples at all but rather countries, states, or subnational units. In this case, we have to appeal to a thought experiment where $F$ represents the fundamental uncertainty in the data-generating process. The metaphor here is that if we could re-run history many times, like the 1993 American classic comedy *Groundhog Day*, data and outcomes would change slightly due to the inherently stochastic nature of the world. The iid assumption, then, is that each of the units in our data has the same DGP producing this data or the same distribution of outcomes under the *Groundhog Day* scenario. The set of all these infinite possible draws from the DGP is sometimes referred to as the **superpopulation**. \n\nNote that there are many situations where the iid assumption is not appropriate. We will cover some of those later in the semester. But much of the innovation and growth in statistics over the last 50 years has been figuring out how to perform statistical inference when iid does not hold. Often, the solutions are specific to the type of iid violation you have (spatial, time-series, network, or clustered). As a rule of thumb, though, if you suspect iid is incorrect, your uncertainty statements will likely be overconfident (for example, confidence intervals, which we'll cover later, are too small). \n\n## Point estimation\n\n### Quantities of interest\n\nWe aim to learn about the data-generating process, represented by the cdf, $F$. We might be interested in estimating the cdf at a general level or only some feature of the distribution, like a mean or conditional expectation function. We will almost always have a particular quantity in mind, but we'll introduce estimation at a general level. So we'll let $\\theta$ represent the quantity of interest. **Point estimation** describes how we obtain a single \"best guess\" about $\\theta$.\n\n::: {.callout-note}\n\nSome refer to quantities of interest as **parameters** or **estimands** (that is, the target of estimation).\n\n:::\n\n\n::: {#exm-prop}\n\n## Population mean\n\nSuppose we wanted to know the proportion of US citizens who support increasing legal immigration in the US, which we denote as $Y_i = 1$. Then our quantity of interest is the mean of this random variable, $\\mu = \\E[Y_i]$, which is the probability of randomly drawing someone from the population supporting increased legal immigration. \n\n:::\n\n::: {#exm-var}\n\n## Population variance\n\nFeeling thermometer scores are a prevalent way to assess how a survey respondent feels about a particular person or group. A survey asks respondents how warmly they feel about a group from 0 to 100, which we will denote $Y_i$. We might be interested in how polarized views are on a group in the population, and one measure of polarization could be the variance, or spread, of the distribution of $Y_i$ around the mean. In this case, $\\sigma^2 = \\V[Y_i]$ would be our quantity of interest. \n\n:::\n\n\n::: {#exm-rct-ii}\n\n## RCT continued\n\nIn @exm-rct, we discussed a typical estimator for an experimental study with a binary treatment. The goal of that experiment is to learn about the difference between two conditional probabilities (or expectations): the average support for increasing legal immigration in the treatment group, $\\mu_1 = \\E[Y_i \\mid D_i = 1]$, and the same average in the control group, $\\mu_0 = \\E[Y_i \\mid D_i = 0]$. This difference, $\\mu_1 - \\mu_0$, is a function of unknown features of these two conditional distributions. \n\n:::\n\nEach of these is a function of the (possibly joint) distribution of the data, $F$. In each of these, we are not necessarily interested in the entire distribution, just summaries of it (central tendency, spread). Of course, there are situations where we are also interested in the complete distribution. \n\n### Estimators \n\nWhen our sample size is more than a few observations, it makes no sense to work with the raw data, $X_1, \\ldots, X_n$, and we inevitably will need to *summarize* the data in some way. We can represent this summary as a function, $g(x_1, \\ldots, x_n)$, which might be the formula for the sample mean or sample variance. This function is just a regular function that takes in $n$ numbers (or vectors) and returns a number (or vector). We can also define a random variable based on this function, $Y = g(X_1, \\ldots, X_n)$, which inherits its randomness from the randomness of the data. Before we see the data, we don't know what values of $X_1, \\ldots, X_n$ we will see, so we don't know what value of $Y$ we'll see either. We call the random variable $Y = g(X_1, \\ldots, X_n)$ a **statistic** (or sometimes sample statistics), and we refer to the probability distribution of a statistic $Y$ as the **sampling distribution** of $Y$. \n\n\n::: {.callout-warning}\n\nThere is one potential confusion in how we talk about \"statistics.\" Just above, we defined a statistic as a random variable based on it being a function of random variables (the data). But we sometimes refer to the calculated value as a statistic as well, which is a specific number that you see in your R output. To be precise, we should call the latter the **realized value** of the statistic, but message discipline is difficult to enforce in this context. A simple example might help. Suppose that $X_1$ and $X_2$ are the results of a roll of two standard six-sided dice. Then the statistic $Y = X_1 + X_2$ is a random variable that has a distribution over the numbers from $\\{2, \\ldots, 12\\}$ that describes our uncertainty over what the sum will be *before we roll the dice*. Once we have rolled the dice and observed the realized values $X_1 = 3$ and $X_2 = 4$, we observed the realized value of the statistic, $Y = 7$. \n\n:::\n\nAt their most basic, statistics are just data summaries without aim or ambition. Estimators are statistics with a purpose: to provide an \"educated guess\" about some quantity of interest. \n\n::: {#def-estimator}\nAn **estimator** $\\widehat{\\theta}_n = \\theta(X_1, \\ldots, X_n)$ for some parameter $\\theta$, is a statistic intended as a guess about $\\theta$.\n:::\n\nOne important distinction of jargon is between an estimator and an estimate, similar to the issues with \"statistic\" described above. The estimator is a function of the data, whereas the **estimate** is the *realized value* of the estimator once we see the data. An estimate is a single number, such as 0.38, whereas the estimator is a random variable that has uncertainty over what value it will take. Formally, the estimate is $\\theta(x_1, \\ldots, x_n)$ when the data is $\\{X_1, \\ldots, X_n\\} = \\{x_1, \\ldots, x_n\\}$, whereas we represent the estimator as a function of random variables, $\\widehat{\\theta}_n = \\theta(X_1, \\ldots, X_n)$. \n\n::: {.callout-note}\n\nIt is widespread, though not universal, to use the \"hat\" notation to define an estimator and its estimand. For example, $\\widehat{\\theta}$ (or \"theta hat\") indicates that this estimator is targeting the parameter $\\theta$. \n\n:::\n\n::: {#exm-mean-est}\n\n## Estimators for the population mean\n\nSuppose we would like to estimate the population mean of $F$, which we will represent as $\\mu = \\E[X_i]$. We could choose from several estimators, all with different properties. \n$$\n\\widehat{\\theta}_{n,1} = \\frac{1}{n} \\sum_{i=1}^n X_i, \\quad \\widehat{\\theta}_{n,2} = X_1, \\quad \\widehat{\\theta}_{n,3} = \\text{max}(X_1,\\ldots,X_n), \\quad \\widehat{\\theta}_{n,4} = 3\n$$\nThe first is just the sample mean, which is an intuitive and natural estimator for the population mean. The second just uses the first observation. While this seems silly, this is a valid statistic (it's a function of the data!). The third takes the maximum value in the sample, and the fourth always returns three, regardless of the data. \n:::\n\n## How to find estimators\n\nWhere do estimators come from? There are a couple of different methods that I'll cover briefly here before describing the ones that will form the bulk of this class. \n\n### Parametric models and maximum likelihood \n\nThe first method for generating estimators relies on **parametric models**, where the researcher specifies the exact distribution (up to some unknown parameters) of the DGP. Let $\\theta$ be the parameters of this distribution and we then write $\\{X_1, \\ldots, X_n\\}$ are iid draws from $F_{\\theta}$. We should also formally state the set of possible values the parameters can take, which we call the **parameter space** and usually denote as $\\Theta$. Because we're assuming we know the distribution of the data, we can write the p.d.f. as $f(X_i \\mid \\theta)$ and define the likelihood function as the product of these p.d.f.s over the units as a function of the parameters:\n$$\nL(\\theta) = \\prod_{i=1}^n f(X_i \\mid \\theta).\n$$\nWe can then define the **maximum likelihood** estimator (MLE) for $\\theta$ as the values of the parameter that, well, maximize the likelihood:\n$$\n\\widehat{\\theta}_{mle} = \\argmax_{\\theta \\in \\Theta} \\; L(\\theta)\n$$\nSometimes we can use calculus to derive a closed-form expression for the MLE. Still, we often use iterative techniques that search the parameter space for the maximum. \n\nMaximum likelihood estimators have very nice properties, especially in large samples. Unfortunately, they also require the correct knowledge of the parametric model, which is often difficult to justify. Do we really know if we should model a given event count variable as Poisson or Negative Binomial? The attractive properties of MLE are only as good as our ability to specify the parametric model. \n\n::: {.callout-note}\n\n## No free lunch\n\nOne essential intuition to build about statistics is the **assumptions-precision tradeoff**. You can usually get more precise estimates if you make stronger and potentially more fragile assumptions. Conversely, you will almost always get less accurate estimates if you weaken your assumptions.\n\n:::\n\n### Plug-in estimators\n\nThe second broad class of estimators is **semiparametric** in that we will specify some finite-dimensional parameters of the DGP but leave the rest of the distribution unspecified. For example, we might define a population mean, $\\mu = \\E[X_i]$, and a population variance, $\\sigma^2 = \\V[X_i]$ but leave unrestricted the shape of the distribution. This approach ensures that our estimators will be less dependent on correctly specifying distributions we have little intuition about. \n\nThe primary method for constructing estimators in this setting is to use the **plug-in estimator**, or the estimator that replaces any population mean with a sample mean. Obviously, in the case of estimating the population mean, $\\mu$, this means we will use the **sample mean** as its estimate:\n$$\n\\Xbar_n = \\frac{1}{n} \\sum_{i=1}^n X_i \\quad \\text{estimates} \\quad \\E[X_i] = \\int_{\\mathcal{X}} x f(x)dx\n$$\nWhat are we doing here? We are replacing the unknown population distribution $f(x)$ in the population mean with a discrete uniform distribution over our data points, with $1/n$ probability assigned to each unit. Why do this? It encodes that if we have a random sample, our best guess about the population distribution of $X_i$ is the sample distribution in our actual data. If this intuition fails, you can hold onto an analog principle: sample means of random variables are natural estimators of population means. \n\nWhat about estimating something more complicated, like the expected value of a function of the data, $\\theta = \\E[r(X_i)]$? The key is to see that $f(X_i)$ is also a random variable. Let's call this random variable $Y_i = f(X_i)$. Now we can see that $\\theta$ is just the population expectation of this random variable, and using the plug-in estimator, we get:\n$$\n\\widehat{\\theta} = \\frac{1}{n} \\sum_{i=1}^n Y_i = \\frac{1}{n} \\sum_{i=1}^n r(X_i). \n$$\n\nWith these facts in hand, we can describe the more general plug-in estimator. When we want to estimate some quantity of interest that is a function of population means, we can generate a plug-in estimator by replacing any population mean with a sample mean. Formally, let $\\alpha = g\\left(\\E[r(X_i)]\\right)$ be a parameter that is defined as a function of the population mean of a (possibly vector-valued) function of the data. Then, we can estimate this parameter by plugging in the sample mean for the population mean to get the **plug-in estimator**,\n$$\n\\widehat{\\alpha} = g\\left( \\frac{1}{n} \\sum_{i=1}^n r(X_i) \\right) \\quad \\text{estimates} \\quad \\alpha = g\\left(\\E[r(X_i)]\\right)\n$$\nThis approach to plug-in estimation with sample means is very general and will allow us to derive estimators in various settings. \n\n::: {#exm-var-est}\n\n## Estimating population variance\n\nThe population variance of a random variable is $\\sigma^2 = \\E[(X_i - \\E[X_i])^2]$. To derive a plug-in estimator for this quantity, we replace the inner $\\E[X_i]$ with $\\Xbar_n$ and the outer expectation with another sample mean:\n$$\n\\widehat{\\sigma}^2 = \\frac{1}{n} \\sum_{i=1}^n (X_i - \\Xbar_n)^2.\n$$\nThis plug-in estimator differs from the standard sample variance, which divides by $n - 1$ rather than $n$. This minor difference does not matter in moderate to large samples.\n:::\n\n::: {#exm-cov-est}\n\n## Estimating population covariance\n\nSuppose we have two variables, $(X_i, Y_i)$. A natural quantity of interest here is the population covariance between these variables, \n$$\n\\sigma_{xy} = \\text{Cov}[X_i,Y_i] = \\E[(X_i - \\E[X_i])(Y_i-\\E[Y_i])],\n$$\nwhich has the plug-in estimator,\n$$\n\\widehat{\\sigma}_{xy} = \\frac{1}{n} \\sum_{i=1}^n (X_i - \\Xbar_n)(Y_i - \\Ybar_n).\n$$\n:::\n\n::: {.callout-note}\n\n## Notation alert\n\nGiven the connection between the population mean and the sample mean, you will sometimes see the $\\E_n[\\cdot]$ operator used as a shorthand for the sample average:\n$$\n\\E_n[r(X_i)] \\equiv \\frac{1}{n} \\sum_{i=1}^n r(X_i).\n$$\n\n:::\n\nFinally, plug-in estimation goes beyond just replacing population means with sample means. We can derive estimators of the population quantiles like the median with sample versions of those quantities. What unifies all of these approaches is replacing the unknown population cdf, $F$, with the empirical cdf, \n$$\n\\widehat{F}_n(x) = \\frac{\\sum_{i=1}^n \\mathbb{I}(X_i \\leq x)}{n},\n$$\nwhere $\\mathbb{I}(A)$ is an *indicator function* that take the value 1 if the event $A$ occurs and 0 otherwise. For a more complete and technical treatment of these ideas, see Wasserman (2004) Chapter 7.\n\n\n\n## The three distributions: population, empirical, and sampling\n\nOnce we start to wade into estimation, there are several distributions to keep track of, and things can quickly become confusing. Three specific distributions are all related and easy to confuse, but keeping them distinct is crucial. \n\nThe **population distribution** is the distribution of the random variable, $X_i$, which we have labeled $F$ and is our target of inference. Then there is the **empirical distribution**, which is the distribution of the actual realizations of the random variables in our samples (that is, the numbers in our data frame), $X_1, \\ldots, X_n$. Because this is a random sample from the population distribution and can serve as an estimator of $F$, we sometimes call this $\\widehat{F}_n$. \n\n\n\n \n\n\nSeparately from both is the **sampling distribution of an estimator**, which is the probability distribution of $\\widehat{\\theta}_n$. It represents our uncertainty about our estimate before we see the data. Remember that our estimator is itself a random variable because it is a function of random variables: the data itself. That is, we defined the estimator as $\\widehat{\\theta}_n = \\theta(X_1, \\ldots, X_n)$. \n\n::: {#exm-three-dist}\n## Likert responses\n\nSuppose $X_i$ is the answer to a question, \"How much do you agree with the following statement: Immigrants are a net positive for the United States,\" with a $X_i = 0$ being \"strongly disagree,\" $X_i = 1$ being \"disagree,\" $X_i = 2$ being \"neither agree nor disagree,\" $X_i = 3$ being \"agree,\" and $X_i = 4$ being \"strongly agree.\"\n\nThe population distribution describes the probability of randomly selecting a person with each one of these values, $\\P(X_i = x)$. The empirical distribution would be the fraction of our data taking each value. And the sampling distribution of the sample mean, $\\Xbar_n$, would be the distribution of the sample mean across repeated samples from the population. \n\nSuppose the population distribution was binomial with four trials and probability of success $p = 0.4$. We could generate one sample with $n = 10$ and thus one empirical distribution using `rbinom()`:\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_samp <- rbinom(n = 10, size = 5, prob = 0.4)\nmy_samp\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 3 2 3 4 0 2 3 2 2\n```\n:::\n\n```{.r .cell-code}\ntable(my_samp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nmy_samp\n0 1 2 3 4 \n1 1 4 3 1 \n```\n:::\n:::\n\n\n\nAnd we can generate one draw from the sampling distribution of $\\Xbar_n$ by taking the mean of this sample:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(my_samp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2.2\n```\n:::\n:::\n\n\n \nBut, if we had a different sample, it would have a different empirical distribution and thus give us a different estimate of the sample mean:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_samp2 <- rbinom(n = 10, size = 5, prob = 0.4)\nmean(my_samp2) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2\n```\n:::\n:::\n\n\n\nThe sampling distribution is the distribution of these sample means across repeated sampling.\n \n\n:::\n\n\n## Finite-sample properties of estimators\n\nAs we discussed when we introduced estimators, their usefulness depends on how well they help us learn about the quantity of interest. If we get an estimate $\\widehat{\\theta} = 1.6$, we would like to know that this is \"close\" to the true parameter $\\theta$. The sampling distribution is the key to answering these questions. Intuitively, we would like the sampling distribution of $\\widehat{\\theta}_n$ to be as tightly clustered around the true as $\\theta$ as possible. Here, though, we run into a problem: the sampling distribution depends on the population distribution since it is about repeated samples of the data from that distribution filtered through the function $\\theta()$. Since $F$ is unknown, this implies that the sampling distribution will also usually be unknown. \n\nEven though we cannot precisely pin down the entire sampling distribution, we can use assumptions to derive specific properties of the sampling distribution that will be useful in comparing estimators. \n\n\n### Bias\n\nThe first property of the sampling distribution concerns its central tendency. In particular, we will define the **bias** (or **estimation bias**) of estimator $\\widehat{\\theta}$ for parameter $\\theta$ as\n$$\n\\text{bias}[\\widehat{\\theta}] = \\E[\\widehat{\\theta}] - \\theta,\n$$\nwhich is the difference between the mean of the estimator (across repeated samples) and the true parameter. All else equal, we would like estimation bias to be as small as possible. The smallest possible bias, obviously, is 0, and we define an **unbiased estimator** as one with $\\text{bias}[\\widehat{\\theta}] = 0$ or equivalently, $\\E[\\widehat{\\theta}] = \\theta$. \n\nHowever, all else is not always equal, and unbiasedness is not a property to become overly attached to. Many biased estimators have other attractive properties, and many popular modern estimators are biased. \n\n::: {#exm-mean-unbiased}\n\n## Unbiasedness of the sample mean\nWe can show that the sample mean is unbiased for the population mean when the data is iid and $\\E|X| < \\infty$. In particular, we simply apply the rules of expectations:\n$$\\begin{aligned}\n\\E\\left[ \\Xbar_n \\right] &= \\E\\left[\\frac{1}{n} \\sum_{i=1}^n X_i\\right] & (\\text{definition of } \\Xbar_n) \\\\\n&= \\frac{1}{n} \\sum_{i=1}^n \\E[X_i] & (\\text{linearity of } \\E)\\\\\n&= \\frac{1}{n} \\sum_{i=1}^n \\mu & (X_i \\text{ identically distributed})\\\\\n&= \\mu.\n\\end{aligned}$$\nNotice that we only used the \"identically distributed\" part of iid. Independence is not needed. \n\n:::\n\n::: {.callout-warning}\n\nProperties like unbiasedness might only hold for a subset of DGPs. For example, we just showed that the sample mean is unbiased, but only when the population mean is finite. There are probability distributions like the Cauchy where the expected value diverges and is not finite. So we are dealing with a restricted class of DGPs that rules out such distributions. You may see this sometimes formalized by defining a class $\\mathcal{F}$ of distributions, and unbiasedness might hold in that class if it is unbiased for all $F \\in \\mathcal{F}$. \n\n:::\n\n\n### Estimation variance and the standard error\n\nIf a \"good\" estimator tends to be close to the truth, we should also care about the spread of the sampling distribution. In particular, we define the **sampling variance** as the variance of an estimator's sampling distribution, $\\V[\\widehat{\\theta}]$, which measures how spread out the estimator is around its mean. For an unbiased estimator, lower sampling variance implies the distribution of $\\widehat{\\theta}$ is more concentrated around the truth. \n\n\n::: {#exm-mean-var}\n\n## Sampling variance of the sample mean\n\nWe can establish the sampling variance of the sample mean of iid data for all $F$ such that $\\V[X_i]$ is finite (more precisely, $\\E[X_i^2] < \\infty$)\n\n$$\\begin{aligned}\n \\V\\left[ \\Xbar_n \\right] &= \\V\\left[ \\frac{1}{n} \\sum_{i=1}^n X_i \\right] & (\\text{definition of } \\Xbar_n) \\\\\n &= \\frac{1}{n^2} \\V\\left[ \\sum_{i=1}^n X_i \\right] & (\\text{property of } \\V) \\\\\n &= \\frac{1}{n^2} \\sum_{i=1}^n \\V[X_i] & (\\text{independence}) \\\\\n &= \\frac{1}{n^2} \\sum_{i=1}^n \\sigma^2 & (X_i \\text{ identically distributed}) \\\\\n &= \\frac{\\sigma^2}{n}\n\\end{aligned}$$\n\n:::\n\nAn alternative measure of spread for any distribution is the standard deviation, which is on the same scale as the original random variable. We call the standard deviation of the sampling distribution of $\\widehat{\\theta}$ the **standard error** of $\\widehat{\\theta}$: $\\se(\\widehat{\\theta}) = \\sqrt{\\V[\\widehat{\\theta}]}$. \n\nGiven the above derivation, the standard error of the sample mean under iid sampling is $\\sigma / \\sqrt{n}$. \n\n\n### Mean squared error\n\nBias and sampling variance measure two different aspects of being a \"good\" estimator. Ideally, we want the estimator to be as close as possible to the true value. One summary measure of the quality of an estimator is the **mean squared error** or **MSE**, which is \n$$\n\\text{MSE} = \\E[(\\widehat{\\theta}_n-\\theta)^2].\n$$\nIdeally, we would have this be as small as possible!\n\nWe can also relate the MSE to the bias and the sampling variance (provided it is finite) with the following decomposition result: \n$$\n\\text{MSE} = \\text{bias}[\\widehat{\\theta}_n]^2 + \\V[\\widehat{\\theta}_n]\n$$\nThis decomposition implies that, for unbiased estimators, MSE is the sampling variance. It also highlights why we might accept some bias for significant reductions in variance for lower overall MSE. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two sampling distributions](02_estimation_files/figure-pdf/mse-1.pdf)\n:::\n:::\n\n\n\nIn this figure, we show the sampling distributions of two estimators, $\\widehat{\\theta}_a$, which is unbiased (centered on the true value $\\theta$) but with a high sampling variance, and $\\widehat{\\theta}_b$ which is slightly biased but with much lower sampling variance. Even though $\\widehat{\\theta}_b$ is biased, the probability of drawing a value close to the truth is higher than for $\\widehat{\\theta}_a$. This balancing between bias and variance is precisely what the MSE helps capture and, indeed, in this case, $MSE[\\widehat{\\theta}_b] < MSE[\\widehat{\\theta}_a]$. \n\n\n\n", "supporting": [ "02_estimation_files/figure-pdf" ], diff --git a/_freeze/02_estimation/figure-pdf/mse-1.pdf b/_freeze/02_estimation/figure-pdf/mse-1.pdf index f794680b48f3d88848bce4c779efb204aeba408f..af652253ba4e2557c00570569e0977958fea9d9f 100644 GIT binary patch delta 133 zcmX?_bu?>(hAM}VnUR5^g^|f*162_?XS0uL4=;>oI{AT-G@P@U+4wP|v!k)4nVY4n msky76n~9;JiIJs+ld+qbvw@+pxw*5eft`X4AtjSFO=SS(s32wl delta 133 zcmX?_bu?>(hAM}FnW3qHsj1Oq162_?XS0uL4= 0$ there is some $n_{\\epsilon} < \\infty$ such that for all $n \\geq n_{\\epsilon}$, $|a_n - a| \\leq \\epsilon$. \n:::\n\nWe say that $a_n$ **converges** to $a$ if $\\lim_{n\\rightarrow\\infty} a_n = a$. Basically, a sequence converges to a number if the sequence gets closer and closer to that number as the sequence goes on. \n\nCan we apply this same idea to sequences of random variables (like estimators)? Let's look at a few examples that help clarify why this might be difficult.[^wass] Let's say that we have a sequence of $a_n = a$ for all $n$ (that is, a constant sequence). Then obviously $\\lim_{n\\rightarrow\\infty} a_n = a$. Now let's say we have a sequence of random variables, $X_1, X_2, \\ldots$, that are all independent with a standard normal distribution, $N(0,1)$. From the analogy to the deterministic case, it is tempting to say that $X_n$ converges to $X \\sim N(0, 1)$, but notice that because they are all different random variables, $\\P(X_n = X) = 0$. Thus, we must be careful about saying how one variable converges to another variable. \n\nAnother example highlights subtle problems with a sequence of random variables converging to a single value. Suppose we have a sequence of random variables $X_1, X_2, \\ldots$ where $X_n \\sim N(0, 1/n)$. Clearly, the distribution of $X_n$ will concentrate around 0 for large values of $n$, so it is tempting to say that $X_n$ converges to 0. But notice that $\\P(X_n = 0) = 0$ because of the nature of continuous random variables. \n\n\n\n\n[^wass]: Due to Wasserman (2004), Chapter 5.\n\n\n## Convergence in probability and consistency\n\nThere are several different ways that a sequence of random variance can converge. The first type of convergence deals with sequences converging to a single value.[^inprob]\n\n[^inprob]: Technically, a sequence can also converge in probability to another random variable, but the use case of converging to a single number is much more common in evaluating estimators. \n\n::: {#def-inprob}\nA sequence of random variables, $X_1, X_2, \\ldots$, is said to **converge in probability** to a value $b$ if for every $\\varepsilon > 0$,\n$$\n\\P(|X_n - b| > \\varepsilon) \\rightarrow 0,\n$$\nas $n\\rightarrow \\infty$. We write this $X_n \\inprob b$. \n:::\n\nWith deterministic sequences, we said that $a_n$ converges to $a$ as it gets closer and closer to $a$ as $n$ gets bigger. For convergence in probability, the sequence of random variables converges to $b$ if the probability that random variables are far away from $b$ gets smaller and smaller as $n$ gets big. \n\n\n::: {.callout-note}\n\n## Notation alert\n\nYou will sometimes see convergence in probability written as $\\text{plim}(Z_n) = b$ if $Z_n \\inprob b$, $\\text{plim}$ stands for \"probability limit.\"\n\n:::\n\n\nConvergence in probability is crucial for evaluating estimators. While we said that unbiasedness was not the be-all and end-all of properties of estimators, the following property is an essential and fundamental property that we would like all good estimators to have. \n\n::: {#def-consistency}\nAn estimator is **consistent** if $\\widehat{\\theta}_n \\inprob \\theta$. \n:::\n\nConsistency of an estimator implies that the sampling distribution of this estimator \"collapses\" on the true value as the sample size gets large. We say an estimator is inconsistent if it converges in probability to any other value, which is obviously a terrible property of an estimator. As the sample size gets large, the probability that an inconsistent estimator will be close to the truth will approach 0. \n\nWe can also define convergence in probability for a sequence of random vectors, $\\X_1, \\X_2, \\ldots$, where $\\X_i = (X_{i1}, \\ldots, X_{ik})$ is a random vector of length $k$. This sequence convergences in probability to a vector $\\mb{b} = (b_1, \\ldots, b_k)$ if and only if each random variable in the vector converges to the corresponding element in $\\mb{b}$, or that $X_{nj} \\inprob b_j$ for all $j = 1, \\ldots, k$. \n\n\n## Useful inequalities\n\nAt first glance, establishing an estimator's consistency will be difficult. How can we know if a distribution will collapse to a specific value without knowing the shape or family of the distribution? It turns out that there are certain relationships between the mean and variance of a random variable and certain probability statements that hold for all distributions (that have finite variance, at least). These relationships will be crucial to establishing results that do not depend on a specific distribution. \n\n\n::: {#thm-markov}\n\n## Markov Inequality\n\nFor any r.v. $X$ and any $\\delta >0$, \n$$\n\\P(|X| \\geq \\delta) \\leq \\frac{\\E[|X|]}{\\delta}.\n$$\n:::\n\n::: {.proof}\n\nNotice that we can let $Y = |X|/\\delta$ and rewrite the statement as $\\P(Y \\geq 1) \\leq \\E[Y]$ (since $\\E[|X|]/\\delta = \\E[|X|/\\delta]$ by the properties of expectation), which is what we will show. But notice that\n$$\n\\mathbb{1}(Y \\geq 1) \\leq Y.\n$$\nWhy does this hold? We can investigate the two possible values of the indicator function to see. If $Y$ is less than 1, then the indicator function will be 0, but recall that $Y$ is nonnegative, so we know that it must be at least as big as 0 so that inequality holds. If $Y \\geq 1$, then the indicator function will take the value one, but we just said that $Y \\geq 1$, so the inequality holds. If we take the expectation of both sides of this inequality, we obtain the result (remember, the expectation of an indicator function is the probability of the event).\n\n\n:::\n\nIn words, Markov's inequality says that the probability of a random variable being large in magnitude cannot be high if the average is not large in magnitude. Blitzstein and Hwang (2019) provide an excellent intuition behind this result. Let $X$ be the income of a randomly selected individual in a population and set $\\delta = 2\\E[X]$ so that the inequality becomes $\\P(X > 2\\E[X]) < 1/2$ (assuming that all income is nonnegative). Here, the inequality says that the share of the population with an income twice the average must be less than 0.5 since if more than half the population were making twice the average income, then the average would have to be higher. \n\nIt's pretty astounding how general this result is since it holds for all random variables. Of course, its generality comes at the expense of not being very informative. If $\\E[|X|] = 5$, for instance, the inequality tells us that $\\P(|X| \\geq 1) \\leq 5$, which is not very helpful since we already know that probabilities are less than 1! We can get tighter bounds if we are willing to make some assumptions about $X$. \n\n::: {#thm-chebyshev}\n## Chebyshev Inequality\nSuppose that $X$ is r.v. for which $\\V[X] < \\infty$. Then, for every real number $\\delta > 0$,\n$$\n\\P(|X-\\E[X]| \\geq \\delta) \\leq \\frac{\\V[X]}{\\delta^2}.\n$$\n:::\n\n\n::: {.proof}\nTo prove this, we only need to square both sides of the inequality inside the probability statement and apply Markov's inequality:\n$$\n\\P\\left( |X - \\E[X]| \\geq \\delta \\right) = \\P((X-\\E[X])^2 \\geq \\delta^2) \\leq \\frac{\\E[(X - \\E[X])^2]}{\\delta^2} = \\frac{\\V[X]}{\\delta^2},\n$$\nwith the last equality holding by the definition of variance. \n:::\n\nChebyshev's inequality is a straightforward extension of the Markov result: the probability of a random variable being far from its mean (that is, $|X-\\E[X]|$ being large) is limited by the variance of the random variable. If we let $\\delta = c\\sigma$, where $\\sigma$ is the standard deviation of $X$, then we can use this result to bound the normalized:\n$$\n\\P\\left(\\frac{|X - \\E[X]|}{\\sigma} > c \\right) \\leq \\frac{1}{c^2}.\n$$\nThis statement says the probability of being two standard deviations away from the mean must be less than 1/4 = 0.25. Notice that this bound can be fairly wide. If $X$ has a normal distribution, we know that about 5% of draws will be greater than 2 SDs away from the mean, much lower than the 25% bound implied by Chebyshev's inequality. \n\n## The law of large numbers\n\nWe can now use these inequalities to show how estimators can be consistent for their target quantities of interest. Why are these inequalities helpful for this purpose? Remember that convergence in probability was about the probability of an estimator being far away from a value going to zero. Chebyshev's inequality shows that we can bound these exact probabilities. \n\nThe most famous consistency result has a special name.\n\n::: {#thm-lln}\n\n## Weak Law of Large Numbers\nLet $X_1, \\ldots, X_n$ be a an i.i.d. draws from a distribution with mean $\\mu = \\E[X_i]$ and variance $\\sigma^2 = \\V[X_i] < \\infty$. Let $\\Xbar_n = \\frac{1}{n} \\sum_{i =1}^n X_i$. Then, $\\Xbar_n \\inprob \\mu$.\n\n:::\n\n::: {.proof}\nRecall that the sample mean is unbiased, so $\\E[\\Xbar_n] = \\mu$ with sampling variance $\\sigma^2/n$. We can then apply Chebyshev to the sample mean to get\n$$\n\\P(|\\Xbar_n - \\mu| \\geq \\delta) \\leq \\frac{\\sigma^2}{n\\delta^2}\n$$\nAn $n\\rightarrow\\infty$, the right-hand side goes to 0, which means that the left-hand side also must go to 0, which is the definition of $\\Xbar_n$ converging in probability to $\\mu$. \n:::\n\nThe weak law of large numbers (WLLN) shows that, under general conditions, the sample mean gets closer to the population mean as $n\\rightarrow\\infty$. This result holds even when the variance of the data is infinite, though that's a situation that most analysts will rarely face. \n\n::: {.callout-note}\n\nThe naming of the \"weak\" law of large numbers seems to imply the existence of a \"strong\" law of large numbers (SLLN), and this is true. The SLLN states that the sample mean converges to the population mean with probability 1. This type of convergence, called **almost sure convergence**, is stronger than convergence in probability which only says that the probability of the sample mean being close to the population mean converges to 1. While it is nice to know that this stronger form of convergence holds for the sample mean under the same assumptions, it is rare for folks outside of theoretical probability and statistics to rely on almost sure convergence. \n\n:::\n\n\n\n::: {#exm-lln}\n\nIt can be helpful to see how the distribution of the sample mean changes as a function of the sample size to appreciate the WLLN. We can show this by taking repeated iid samples of different sizes from an exponential random variable with rate parameter 0.5 so that $\\E[X_i] = 2$. In @fig-lln-sim, we show the distribution of the sample mean (across repeated samples) when the sample size is 15 (black), 30 (violet), 100 (blue), and 1000 (green). We can see how the sample mean distribution is \"collapsing\" on the true population mean, 2. The probability of being far away from 2 becomes progressively smaller. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sampling distribution of the sample mean as a function of sample size.](03_asymptotics_files/figure-pdf/fig-lln-sim-1.pdf){#fig-lln-sim}\n:::\n:::\n\n\n\n\n:::\n\n\nThe WLLN also holds for random vectors in addition to random variables. Let $(\\X_1, \\ldots, \\X_n)$ be an iid sample of random vectors of length $k$, $\\mb{X}_i = (X_{i1}, \\ldots, X_{ik})$. We can define the vector sample mean as just the vector of sample means for each of the entries:\n\n$$\n\\overline{\\mb{X}}_n = \\frac{1}{n} \\sum_{i=1}^n \\mb{X}_i =\n\\begin{pmatrix}\n\\Xbar_{n,1} \\\\ \\Xbar_{n,2} \\\\ \\vdots \\\\ \\Xbar_{n, k}\n\\end{pmatrix}\n$$\nSince this is just a vector of sample means, each random variable in the random vector will converge in probability to the mean of that random variable. Fortunately, this is the exact definition of convergence in probability for random vectors. We formally write this in the following theorem.\n\n::: {#thm-vector-wlln}\n\nIf $\\X_i \\in \\mathbb{R}^k$ are iid draws from a distribution with $\\E[X_{ij}] < \\infty$ for all $j=1,\\ldots,k$ then as $n\\rightarrow\\infty$ \n\n$$\n\\overline{\\mb{X}}_n \\inprob \\E[\\X] =\n\\begin{pmatrix}\n\\E[X_{i1}] \\\\ \\E[X_{i2}] \\\\ \\vdots \\\\ \\E[X_{ik}]\n\\end{pmatrix}.\n$$\n:::\n\n\n::: {.callout-note}\n\n## Notation alert\n\nYou will have noticed that many of the formal results we have presented so far have \"moment conditions\" that certain moments are finite. For the vector WLLN, we saw that applied to the mean of each variable in the vector. Some books use a shorthand for this: $\\E\\Vert \\X_i\\Vert < \\infty$, where\n$$\n\\Vert\\X_i\\Vert = \\left(X_{i1}^2 + X_{i2}^2 + \\ldots + X_{ik}^2\\right)^{1/2}. \n$$\nThis expression has slightly more compact notation, but why does it work? One can show that this function, called the **Euclidean norm** or $L_2$-norm, is a **convex** function, so we can apply Jensen's inequality to show that:\n$$\n\\E\\Vert \\X_i\\Vert \\geq \\Vert \\E[\\X_i] \\Vert = (\\E[X_{i1}]^2 + \\ldots + \\E[X_{ik}]^2)^{1/2}.\n$$\nSo if $\\E\\Vert \\X_i\\Vert$ is finite, all the component means are finite. Otherwise, the right-hand side of the previous equation would be infinite. \n:::\n\n\n\n## Consistency of estimators\n\n\nThe WLLN shows that the sample mean of iid draws is consistent for the population mean, which is a massive result given that so many estimators are sample means of potentially complicated functions of the data. What about other estimators? The proof of the WLLN points to one way to determine if an estimator is consistent: if it is unbiased and the sampling variance shrinks as the sample size grows. \n\n\n::: {#thm-consis}\n\nFor any estimator $\\widehat{\\theta}_n$, if $\\text{bias}[\\widehat{\\theta}_n] = 0$ and $\\V[\\widehat{\\theta}_n] \\rightarrow 0$ as $n\\rightarrow \\infty$, then $\\widehat{\\theta}_n$ is consistent.\n\n:::\n\nThus, for unbiased estimators, if we can characterize its sampling variance, we should be able to tell if it is consistent. This result is handy since working with the probability statements used for the WLLN can sometimes be quite confusing. \n\nWhat about biased estimators? Consider a plug-in estimator like $\\widehat{\\alpha} = \\log(\\Xbar_n)$ where $X_1, \\ldots, X_n$ are iid from a population with mean $\\mu$. We know that for nonlinear functions like logarithms we have $\\log\\left(\\E[Z]\\right) \\neq \\E[\\log(Z)]$, so $\\E[\\widehat{\\alpha}] \\neq \\log(\\E[\\Xbar_n])$ and the plug-in estimator will be biased for $\\log(\\mu)$. It will also be difficult to obtain an expression for the bias in terms of $n$. Is all hope lost here? Must we give up on consistency? No, and in fact, consistency will be much simpler to show in this setting. \n\n\n::: {#thm-inprob-properties}\n\n## Properties of convergence in probability\n\nLet $X_n$ and $Z_n$ be two sequences of random variables such that $X_n \\inprob a$ and $Z_n \\inprob b$, and let $g(\\cdot)$ be a continuous function. Then, \n\n1. $g(X_n) \\inprob g(a)$ (continuous mapping theorem)\n2. $X_n + Z_n \\inprob a + b$\n3. $X_nZ_n \\inprob ab$\n4. $X_n/Z_n \\inprob a/b$ if $b > 0$.\n\n:::\n\nWe can now see that many of the nasty problems with expectations and nonlinear functions are made considerably easier with convergence in probability in the asymptotic setting. So while we know that $\\log(\\Xbar_n)$ is biased for $\\log(\\mu)$, we know that it is consistent since $\\log(\\Xbar_n) \\inprob \\log(\\mu)$ because $\\log$ is a continuous function. \n\n::: {#exm-nonresponse}\n\nSuppose we implemented a survey by randomly selecting a sample from the population of size $n$, but not everyone responded to our survey. Let the data consist of pairs of random variables, $(Y_1, R_1), \\ldots, (Y_n, R_n)$, where $Y_i$ is the question of interest and $R_i$ is a binary indicator for if the respondent answered the question ($R_i = 1$) or not ($R_i = 0$). Our goal is to estimate the mean of the question for responders: $\\E[Y_i \\mid R_i = 1]$. We can use the law of iterated expectation to obtain\n$$\n\\begin{aligned}\n\\E[Y_iR_i] &= \\E[Y_i \\mid R_i = 1]\\P(R_i = 1) + \\E[ 0 \\mid R_i = 0]\\P(R_i = 0) \\\\\n\\implies \\E[Y_i \\mid R_i = 1] &= \\frac{\\E[Y_iR_i]}{\\P(R_i = 1)}\n\\end{aligned}\n$$\n\nThe relevant estimator for this quantity is the mean of the outcome among those who responded, which is slightly more complicated than a typical sample mean because the denominator is a random variable:\n$$\n\\widehat{\\theta}_n = \\frac{\\sum_{i=1}^n Y_iR_i}{\\sum_{i=1}^n R_i}. \n$$\nNotice that this estimator is the ratio of two random variables. The numerator has mean $n\\E[Y_iR_i]$ and the denominator has mean $n\\P(R_i = 1)$. It is then tempting to say that we can take the ratio of these means as the mean of $\\widehat{\\theta}_n$, but expectations are not preserved in nonlinear functions like this. \n\nWe can establish consistency of our estimator, though, by noting that we can rewrite the estimator as a ratio of sample means\n$$\n\\widehat{\\theta}_n = \\frac{(1/n)\\sum_{i=1}^n Y_iR_i}{(1/n)\\sum_{i=1}^n R_i},\n$$\nwhere by the WLLN the numerator $(1/n)\\sum_{i=1}^n Y_iR_i \\inprob \\E[Y_iR_i]$ and the denominator $(1/n)\\sum_{i=1}^n R_i \\inprob \\P(R_i = 1)$. Thus, by @thm-inprob-properties, we have \n$$\n\\widehat{\\theta}_n = \\frac{(1/n)\\sum_{i=1}^n Y_iR_i}{(1/n)\\sum_{i=1}^n R_i} \\inprob \\frac{\\E[Y_iR_i]}{\\P[R_i = 1]} = \\E[Y_i \\mid R_i = 1],\n$$\nso long as the probability of responding is greater than zero. This establishes that our sample mean among responders, while biased for the conditional expectation among responders, is consistent for that quantity. \n:::\n\n\nKeeping the difference between unbiased and consistent clear in your mind is essential. You can easily create ridiculous unbiased estimators that are inconsistent. Let's return to our iid sample, $X_1, \\ldots, X_n$, from a population with $E[X_i] = \\mu$. There is nothing in the rule book against defining an estimator $\\widehat{\\theta}_{first} = X_1$ that uses the first observation as the estimate. This estimator is silly, but it is unbiased since $\\E[\\widehat{\\theta}_{first}] = \\E[X_1] = \\mu$. It is inconsistent since the sampling variance of this estimator is just the variance of the population distribution, $\\V[\\widehat{\\theta}_{first}] = \\V[X_i] = \\sigma^2$, which does not change as a function of the sample size. Generally speaking, we can regard \"unbiased but inconsistent\" estimators as silly and not worth our time (along with biased and inconsistent estimators). \n\nSome estimators are biased but consistent that are often much more interesting. We already saw one such estimator in @exm-nonresponse, but there are many more. Maximum likelihood estimators, for example, are (under some regularity conditions) consistent for the parameters of a parametric model but are often biased. \n\n::: {#exm-plug-in-variance}\n\n## Plug-in variance estimator\n\nIn the last chapter, we introduced the plug-in estimator for the population variance, \n$$\n\\widehat{\\sigma}^2 = \\frac{1}{n} \\sum_{i=1}^n (X_i - \\Xbar_n)^2,\n$$\nwhich we will now show is biased but consistent. To see the bias note that we can rewrite the sum of square deviations $$\\sum_{i=1}^n (X_i - \\Xbar_n)^2 = \\sum_{i=1}^n X_i^2 - n\\Xbar_n. $$\nThen, the expectation of the plug-in estimator is\n$$\n\\begin{aligned}\n\\E[\\widehat{\\sigma}^2] & = \\E\\left[\\frac{1}{n}\\sum_{i=1}^n X_i^2\\right] - \\E[\\Xbar_n^2] \\\\\n&= \\E[X_i^2] - \\frac{1}{n^2}\\sum_{i=1}^n \\sum_{j=1}^n \\E[X_iX_j] \\\\\n&= \\E[X_i^2] - \\frac{1}{n^2}\\sum_{i=1}^n \\E[X_i^2] - \\frac{1}{n^2}\\sum_{i=1}^n \\sum_{j\\neq i} \\underbrace{\\E[X_i]\\E[X_j]}_{\\text{independence}} \\\\\n&= \\E[X_i^2] - \\frac{1}{n}\\E[X_i^2] - \\frac{1}{n^2} n(n-1)\\mu^2 \\\\\n&= \\frac{n-1}{n} \\left(\\E[X_i^2] - \\mu^2\\right) \\\\\n&= \\frac{n-1}{n} \\sigma^2 = \\sigma^2 - \\frac{1}{n}\\sigma^2\n\\end{aligned}. \n$$\nThus, we can see that the bias of the plug-in estimator is $-(1/n)\\sigma^2$, so it slightly underestimates the variance. Nicely, though, the bias shrinks as a function of the sample size, so according to @thm-consis, it will be consistent so long as the sampling variance of $\\widehat{\\sigma}^2$ shrinks as a function of the sample size, which it does (though omit that proof here). Of course, simply multiplying this estimator by $n/(n-1)$ will give an unbiased and consistent estimator that is also the typical sample variance estimator. \n\n:::\n\n## Convergence in distribution and the central limit theorem\n\nConvergence in probability and the law of large numbers are beneficial for understanding how our estimators will (or will not) collapse to their estimand as the sample size increases. But what about the shape of the sampling distribution of our estimators? For statistical inference, we would like to be able to make probability statements such as $\\P(a \\leq \\widehat{\\theta}_n \\leq b)$. These statements will be the basis of hypothesis testing and confidence intervals. But to make those types of statements, we need to know the entire distribution of $\\widehat{\\theta}_n$, not just the mean and variance. Luckily, established results will allow us to approximate the sampling distribution of a vast swath of estimators when our sample sizes are large. \n\nWe need first to describe a weaker form of convergence to see how we will develop these approximations.\n\n\n::: {#def-indist}\nLet $X_1,X_2,\\ldots$, be a sequence of r.v.s, and for $n = 1,2, \\ldots$ let $F_n(x)$ be the c.d.f. of $X_n$. Then it is said that $X_1, X_2, \\ldots$ **converges in distribution** to r.v. $X$ with c.d.f. $F(x)$ if\n$$\n\\lim_{n\\rightarrow \\infty} F_n(x) = F(x),\n$$\nfor all values of $x$ for which $F(x)$ is continuous. We write this as $X_n \\indist X$ or sometimes $X_n ⇝ X$. \n:::\n\nEssentially, convergence in distribution means that as $n$ gets large, the distribution of $X_n$ becomes more and more similar to the distribution of $X$, which we often call the **asymptotic distribution** of $X_n$ (other names include the **large-sample distribution**). If we know that $X_n \\indist X$, then we can use the distribution of $X$ as an approximation to the distribution of $X_n$, and that distribution can be reasonably accurate. \n\nOne of the most remarkable results in probability and statistics is that a large class of estimators will converge in distribution to one particular family of distributions: the normal. This result is one reason we study the normal so much and why investing in building intuition about it will pay off across many domains of applied work. We call this broad class of results the \"central limit theorem\" (CLT), but it would probably be more accurate to refer to them as \"central limit theorems\" since much of statistics is devoted to showing the result in different settings. We now present the simplest CLT for the sample mean. \n\n\n::: {#thm-clt}\n\n## Central Limit Theorem \nLet $X_1, \\ldots, X_n$ be i.i.d. r.v.s from a distribution with mean $\\mu = \\E[X_i]$ and variance $\\sigma^2 = \\V[X_i]$. Then if $\\E[X_i^2] < \\infty$, we have\n$$\n\\frac{\\Xbar_n - \\mu}{\\sqrt{\\V[\\Xbar_n]}} = \\frac{\\sqrt{n}\\left(\\Xbar_n - \\mu\\right)}{\\sigma} \\indist \\N(0, 1).\n$$\n:::\n\nIn words: the sample mean of a random sample from a population with finite mean and variance will be approximately normally distributed in large samples. Notice how we have not made any assumptions about the distribution of the underlying random variables, $X_i$. They could be binary, event count, continuous, or anything. The CLT is incredibly broadly applicable. \n\n::: {.callout-note}\n\n## Notation alert\n\nWhy do we state the CLT in terms of the sample mean after centering and scaling by its standard error? Suppose we don't normalize the sample mean in this way. In that case, it isn't easy to talk about convergence in distribution because we know from the WLLN that $\\Xbar_n \\inprob \\mu$, so in the limit, the distribution of $\\Xbar_n$ is concentrated at point mass around that value. Normalizing by centering and rescaling ensures that the variance of the resulting quantity will not depend on $n$, so it makes sense to talk about its distribution converging. Sometimes you will see the equivalent result as\n$$\n\\sqrt{n}\\left(\\Xbar_n - \\mu\\right) \\indist \\N(0, \\sigma^2).\n$$\n:::\n\nWe can use this result to state approximations that we can use when discussing estimators such as\n$$\n\\Xbar_n \\overset{a}{\\sim} N(\\mu, \\sigma^2/n),\n$$\nwhere we use $\\overset{a}{\\sim}$ to be \"approximately distributed as in large samples.\" This approximation allows us to say things like: \"in large samples, we should expect the sample mean to between within $2\\sigma/\\sqrt{n}$ of the true mean in 95% of repeated samples.\" These statements will be essential for hypothesis tests and confidence intervals! Estimators so often follow the CLT that we have an expression for this property.\n\n::: {#def-asymptotically-normal}\n\nAn estimator $\\widehat{\\theta}_n$ is **asymptotically normal** if for some $\\theta$\n$$\n\\sqrt{n}\\left( \\widehat{\\theta}_n - \\theta \\right) \\indist N\\left(0,\\V_{\\theta}\\right).\n$$\n\n:::\n\n::: {#exm-bin-clt}\n\nTo illustrate how the CLT works, we can simulate the sampling distribution of the (normalized) sample mean at different sample sizes. Let $X_1, \\ldots, X_n$ be iid samples from a Bernoulli with probability of success 0.25. We then draw repeated samples of size $n=30$ and $n=100$ and calculate $\\sqrt{n}(\\Xbar_n - 0.25)/\\sigma$ for each random sample. @fig-clt plots the density of these two sampling distributions along with a standard normal reference. We can see that even at $n=30$, the rough shape of the density looks normal, with spikes and valleys due to the discrete nature of the data (the sample mean can only take on 31 possible values in this case). By $n=100$, the sampling distribution is very close to the true standard normal. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sampling distributions of the normalized sample mean at n=30 and n=100.](03_asymptotics_files/figure-pdf/fig-clt-1.pdf){#fig-clt}\n:::\n:::\n\n\n\n\n:::\n\n\nThere are several properties of convergence in distribution that are helpful to us. \n\n::: {#thm-indist-properties}\n\n## Properties of convergence in distribution\n\nLet $X_n$ be a sequence of random variables $X_1, X_2,\\ldots$ that converges in distribution to some rv $X$ and let $Y_n$ be a sequence of random variables $Y_1, Y_2,\\ldots$ that converges in probability to some number, $c$. Then, \n\n1. $g(X_n) \\indist g(X)$ for all continuous functions $g$.\n2. $X_nY_n$ converges in distribution to $cX$\n3. $X_n + Y_n$ converges in distribution to $X + c$\n4. $X_n / Y_n$ converges in distribution to $X / c$ if $c \\neq 0$\n\n:::\n\nWe refer to the last three results as **Slutsky's theorem**. These results are often crucial for determining an estimator's asymptotic distribution. \n\nA critical application of Slutsky's theorem is when we replace the (unknown) population variance in the CLT with an estimate. Recall the definition of the **sample variance** as\n$$\nS_n^2 = \\frac{1}{n-1} \\sum_{i=1}^n (X_i - \\Xbar_n)^2,\n$$\nwith the **sample standard deviation** defined as $S_{n} = \\sqrt{S_{n}^2}$. It's easy to show that these are consistent estimators for their respective population parameters\n$$ \nS_{n}^2 \\inprob \\sigma^2 = \\V[X_i], \\qquad S_{n} \\inprob \\sigma,\n$$\nwhich, by Slutsky's theorem, implies that\n$$\n\\frac{\\sqrt{n}\\left(\\Xbar_n - \\mu\\right)}{S_n} \\indist \\N(0, 1)\n$$\nComparing this result to the statement of CLT, we see that replacing the population variance with a consistent estimate of the variance (or standard deviation) does not affect the asymptotic distribution. \n\n\nLike with the WLLN, the CLT holds for random vectors of sample means, where their centered and scaled versions converge to a multivariate normal distribution with a covariance matrix equal to the covariance matrix of the underlying random vectors of data, $\\X_i$. \n\n::: {#thm-multivariate-clt}\n\nIf $\\mb{X}_i \\in \\mathbb{R}^k$ are i.i.d. and $\\E\\Vert \\mb{X}_i \\Vert^2 < \\infty$, then as $n \\to \\infty$,\n$$\n\\sqrt{n}\\left( \\overline{\\mb{X}}_n - \\mb{\\mu}\\right) \\indist \\N(0, \\mb{\\Sigma}),\n$$\nwhere $\\mb{\\mu} = \\E[\\mb{X}_i]$ and $\\mb{\\Sigma} = \\V[\\mb{X}_i] = \\E\\left[(\\mb{X}_i-\\mb{\\mu})(\\mb{X}_i - \\mb{\\mu})'\\right]$.\n\n:::\n\nNotice that $\\mb{\\mu}$ is the vector of population means for all the random variables in $\\X_i$ and $\\mb{\\Sigma}$ is the variance-covariance matrix for that vector.\n\n\n::: {.callout-note}\n\nAs with the notation alert with the WLLN, we are using shorthand here, $\\E\\Vert \\mb{X}_i \\Vert^2 < \\infty$, which implies that $\\E[X_{ij}^2] < \\infty$ for all $j = 1,\\ldots, k$, or equivalently, that the variances of each variable in the sample means has finite variance. \n\n:::\n\n## Confidence intervals\n\n\nWe now turn to an essential application of the central limit theorem: confidence intervals. \n\nYou have run your experiment and presented your readers with your single best guess about the treatment effect with the difference in sample means. You may have also presented the estimated standard error of this estimate to give readers a sense of how variable the estimate is. But none of these approaches answer a fairly compelling question: what range of values of the treatment effect is **plausible** given the data we observe?\n\nA point estimate typically has 0 probability of being the exact true value, but intuitively we hope that the true value is close to this estimate. **Confidence intervals** make this kind of intuition more formal by instead estimating ranges of values with a fixed percentage of these ranges containing the actual parameter value. \n\nWe begin with the basic definition of a confidence interval. \n\n::: {#def-coverage}\nA $1-\\alpha$ **confidence interval** for a real-valued parameter $\\theta$ is a pair of statistics $L= L(X_1, \\ldots, X_n)$ and $U = U(X_1, \\ldots, X_n)$ such that $L < U$ for all values of the sample and such that \n$$ \n\\P(L \\leq \\theta \\leq U \\mid \\theta) \\geq 1-\\alpha, \\quad \\forall \\theta \\in \\Theta.\n$$\n:::\n\nWe say that a $1-\\alpha$ confidence interval covers (contains, captures, traps, etc.) the true value at least $100(1-\\alpha)\\%$ of the time, and we refer to $1-\\alpha$ as the **coverage probability** or simply **coverage**. Typical confidence intervals include 95% percent ($\\alpha = 0.05$) and 90% ($\\alpha = 0.1$). \n\nSo a confidence interval is a random interval with a particular guarantee about how often it will contain the true value. It's important to remember what is random and what is fixed in this setup. The interval varies from sample to sample, but the true value of the parameter stays fixed, and the coverage is how often we should expect the interval to contain that true value. The \"repeating my sample over and over again\" analogy can break down very quickly, so it's sometimes helpful to interpret it as giving guarantees across confidence intervals across different experiments. In particular, suppose that a journal publishes 100 quantitative articles annually, each producing a single 95% confidence interval for their quantity of interest. Then, if the confidence intervals are valid, we should expect 95 of those confidence intervals to contain the true value. \n\n\n::: {.callout-warning}\n\nSuppose you calculate a 95% confidence interval, $[0.1, 0.4]$. It's tempting to make a probability statement like $\\P(0.1 \\leq \\theta \\leq 0.4 \\mid \\theta) = 0.95$ or that there's a 95% chance that the parameter is in $[0.1, 0.4]$. But looking at the probability statement, everything on the left-hand side of the conditioning bar is fixed, so the probability either has to be 0 ($\\theta$ is outside the interval) or 1 ($\\theta$ is in the interval). The coverage probability of a confidence interval refers to its status as a pair of random variables, $(L, U)$, not any particular realization of those variables like $(0.1, 0.4)$. As an analogy, consider if calculated the sample mean as $0.25$ and then try to say that $0.25$ is unbiased for the population mean. This statement doesn't make sense because unbiasedness refers to how the sample mean varies from sample to sample. \n\n:::\n\n\nIn most cases, we will not be able to derive exact confidence intervals but rather confidence intervals that are **asymptotically valid**, which means that if we write the interval as a function of the sample size, $(L_n, U_n)$, they would have **asymptotic coverage**\n$$\n\\lim_{n\\to\\infty} \\P(L_n \\leq \\theta \\leq U_n) \\geq 1-\\alpha \\quad\\forall\\theta\\in\\Theta.\n$$\n\nAsymptotic coverage is the property we can show for most confidence intervals since we usually rely on large sample approximations based on the central limit theorem. \n\n\n\n### Deriving confidence intervals\n\nIf you have taken any statistics before, you probably have seen the standard formula for the 95% confidence interval of the sample mean, \n$$ \n\\left[\\Xbar_n - 1.96\\frac{s}{\\sqrt{n}},\\; \\Xbar_n + 1.96\\frac{s}{\\sqrt{n}}\\right],\n$$\nwhere you can recall that $s$ is the sample standard deviation and $s/\\sqrt{n}$ is the estimate of the standard error of the sample mean. If this is a 95% confidence interval, then the probability that it contains the population mean $\\mu$ should be 0.95, but how can we derive this? We can justify this logic using the central limit theorem, and the argument will hold for any asymptotically normal estimator. \n\n\nLet's say that we have an estimator, $\\widehat{\\theta}_n$ for the parameter $\\theta$ with estimated standard error $\\widehat{\\se}[\\widehat{\\theta}_n]$. If the estimator is asymptotically normal, then in large samples, we know that\n$$ \n\\frac{\\widehat{\\theta}_n - \\theta}{\\widehat{\\se}[\\widehat{\\theta}_n]} \\sim \\N(0, 1).\n$$\nWe can then use our knowledge of the standard normal and the empirical rule to find\n$$ \n\\P\\left( -1.96 \\leq \\frac{\\widehat{\\theta}_n - \\theta}{\\widehat{\\se}[\\widehat{\\theta}_n]} \\leq 1.96\\right) = 0.95\n$$\nand by multiplying each part of the inequality by $\\widehat{\\se}[\\widehat{\\theta}_n]$, we get\n$$ \n\\P\\left( -1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n] \\leq \\widehat{\\theta}_n - \\theta \\leq 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n]\\right) = 0.95,\n$$\nWe then subtract all parts by the estimator to get\n$$ \n\\P\\left(-\\widehat{\\theta}_n - 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n] \\leq - \\theta \\leq -\\widehat{\\theta}_n + 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n]\\right) = 0.95,\n$$\nand finally we multiply all parts by $-1$ (and flipping the inequalities) to arrive at\n$$ \n\\P\\left(\\widehat{\\theta}_n - 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n] \\leq \\theta \\leq \\widehat{\\theta}_n + 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n]\\right) = 0.95.\n$$\nTo connect back to the definition of the confidence interval, we have now shown that the random interval $[L, U]$ where\n$$ \n\\begin{aligned}\n L = L(X_1, \\ldots, X_n) &= \\widehat{\\theta}_n - 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n] \\\\\n U = U(X_1, \\ldots, X_n) &= \\widehat{\\theta}_n + 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n],\n\\end{aligned}\n$$\nis an asymptotically valid estimator.[^1] Replacing $\\Xbar_n$ for $\\widehat{\\theta}_n$ and $s/\\sqrt{n}$ for $\\widehat{\\se}[\\widehat{\\theta}_n]$ establishes how the standard 95% confidence interval for the sample mean above asymptotically valid. \n\n[^1]: Implicit in this analysis is that the standard error estimate is consistent. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Critical values for the standard normal.](03_asymptotics_files/figure-pdf/fig-std-normal-1.pdf){#fig-std-normal}\n:::\n:::\n\n\n\n\nHow can we generalize this to $1-\\alpha$ confidence intervals? For a standard normal rv, $Z$, we know that\n$$ \n\\P(-z_{\\alpha/2} \\leq Z \\leq z_{\\alpha/2}) = 1-\\alpha\n$$\nwhich implies that we can obtain a $1-\\alpha$ asymptotic confidence intervals by using the interval $[L, U]$, where\n$$ \nL = \\widehat{\\theta}_{n} - z_{\\alpha/2} \\widehat{\\se}[\\widehat{\\theta}_{n}], \\quad U = \\widehat{\\theta}_{n} + z_{\\alpha/2} \\widehat{\\se}[\\widehat{\\theta}_{n}]. \n$$\nThis is sometimes shortened to $\\widehat{\\theta}_n \\pm z_{\\alpha/2} \\widehat{\\se}[\\widehat{\\theta}_{n}]$. Remember that we can obtain the values of $z_{\\alpha/2}$ easily from R:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## alpha = 0.1 for 90% CI\nqnorm(0.1 / 2, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.644854\n```\n:::\n:::\n\n\n\nAs a concrete example, then, we could derive a 90% asymptotic confidence interval for the sample mean as \n$$ \n\\left[\\Xbar_{n} - 1.64 \\frac{\\widehat{\\sigma}}{\\sqrt{n}}, \\Xbar_{n} + 1.64 \\frac{\\widehat{\\sigma}}{\\sqrt{n}}\\right]\n$$\n\n\n### Interpreting confidence intervals\n\nRemember that the interpretation of confidence is how the random interval performs over repeated samples. A valid 95% confidence interval is a random interval containing the true value in 95% of samples. Simulating repeated samples helps clarify this. \n\n\n::: {#exm-cis}\n\nSuppose we are taking samples of size $n=500$ of random variables where $X_i \\sim \\N(1, 10)$, and we want to estimate the population mean $\\E[X] = 1$. To do so, we repeat the following steps:\n\n1. Draw a sample of $n=500$ from $\\N(1, 10)$. \n2. Calculate the 95% confidence interval sample mean $\\Xbar_n \\pm 1.96\\widehat{\\sigma}/\\sqrt{n}$. \n3. Plot the intervals along the x-axis and color them blue if they contain the truth (1) and red if not. \n\n@fig-ci-sim shows 100 iterations of these steps. We see that, as expected, most calculated CIs contain the true value. Five random samples produce intervals that fail to include 1, an exact coverage rate of 95%. Of course, this is just one simulation, and a different set of 100 random samples might have produced a slightly different coverage rate. The guarantee of the 95% confidence intervals is that if we were to continue to take these repeated samples, the long-run frequency of intervals covering the truth would approach 0.95. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![95% confidence intervals from 100 random samples. Intervals are blue if they contain the truth and red if they do not.](03_asymptotics_files/figure-pdf/fig-ci-sim-1.pdf){#fig-ci-sim}\n:::\n:::\n\n\n\n\n:::\n\n\n## Delta method {#sec-delta-method}\n\nSuppose that we know that an estimator follows the CLT, and so we have \n$$\n\\sqrt{n}\\left(\\widehat{\\theta}_n - \\theta \\right) \\indist \\N(0, V),\n$$\nbut we actually want to estimate $h(\\theta)$ so we use the plug-in estimator, $h(\\widehat{\\theta}_n)$. It seems like we should be able to apply part 1 of @thm-indist-properties. Still, the CLT established the large-sample distribution of the centered and scaled random sequence, $\\sqrt{n}(\\widehat{\\theta}_n - \\theta)$, not to the original estimator itself like we would need to investigate the asymptotic distribution of $h(\\widehat{\\theta}_n)$. We can use a little bit of calculus to get an approximation of the distribution we need. \n\n\n::: {#thm-delta-method}\n\nIf $\\sqrt{n}\\left(\\widehat{\\theta}_n - \\theta\\right) \\indist \\N(0, V)$ and $h(u)$ is continuously differentiable in a neighborhood around $\\theta$, then as $n\\to\\infty$,\n$$\n\\sqrt{n}\\left(h(\\widehat{\\theta}_n) - h(\\theta) \\right) \\indist \\N(0, (h'(\\theta))^2 V).\n$$\n\n:::\n\nUnderstanding what's happening here is useful since it might help give intuition as to when this might go wrong. Why do we focus on continuously differentiable functions, $h()$? These functions can be well-approximated with a line in a neighborhood around a given point like $\\theta$. In @fig-delta, we show this where the tangent line at $\\theta_0$, which has slope $h'(\\theta_0)$, is very similar to $h(\\theta)$ for values close to $\\theta_0$. Because of this, we can approximate the difference between $h(\\widehat{\\theta}_n)$ and $h(\\theta_0)$ with the what this tangent line would give us:\n$$\n\\underbrace{\\left(h(\\widehat{\\theta_n}) - h(\\theta_0)\\right)}_{\\text{change in } y} \\approx \\underbrace{h'(\\theta_0)}_{\\text{slope}} \\underbrace{\\left(\\widehat{\\theta}_n - \\theta_0\\right)}_{\\text{change in } x},\n$$\nand then multiplying both sides by the $\\sqrt{n}$ gives\n$$\n\\sqrt{n}\\left(h(\\widehat{\\theta_n}) - h(\\theta_0)\\right) \\approx h'(\\theta_0)\\sqrt{n}\\left(\\widehat{\\theta}_n - \\theta_0\\right). \n$$\nThe right-hand side of this approximation converges to $h'(\\theta_0)Z$, where $Z$ is a random variable with $\\N(0, V)$. The variance of this quantity will be\n$$\n\\V[h'(\\theta_0)Z] = (h'(\\theta_0))^2\\V[Z] = (h'(\\theta_0))^2V,\n$$\nby the properties of variances. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Linear approximation to nonlinear functions.](03_asymptotics_files/figure-pdf/fig-delta-1.pdf){#fig-delta}\n:::\n:::\n\n\n\n\n\n::: {#exm-log}\n\nLet's return to the iid sample $X_1, \\ldots, X_n$ with mean $\\mu = \\E[X_i]$ and variance $\\sigma^2 = \\V[X_i]$. From the CLT, we know that $\\sqrt{n}(\\Xbar_n - \\mu) \\indist \\N(0, \\sigma^2)$. Suppose that we want to estimate $\\log(\\mu)$, so we use the plug-in estimator $\\log(\\Xbar_n)$ (assuming that $X_i > 0$ for all $i$ so that we can take the log). What is the asymptotic distribution of this estimator? This is a situation where $\\widehat{\\theta}_n = \\Xbar_n$ and $h(\\mu) = \\log(\\mu)$. From basic calculus, we know that\n$$\nh'(\\mu) = \\frac{\\partial \\log(\\mu)}{\\partial \\mu} = \\frac{1}{\\mu},\n$$\nso applying the delta method, we can determine that \n$$\n\\sqrt{n}\\left(\\log(\\Xbar_n) - \\log(\\mu)\\right) \\indist \\N\\left(0,\\frac{\\sigma^2}{\\mu^2} \\right).\n$$\n\n:::\n\n::: {#exm-exp}\n\nWhat about estimating the $\\exp(\\mu)$ with $\\exp(\\Xbar_n)$? Recall that\n$$\nh'(\\mu) = \\frac{\\partial \\exp(\\mu)}{\\partial \\mu} = \\exp(\\mu)\n$$\nso applying the delta method, we have\n$$\n\\sqrt{n}\\left(\\exp(\\Xbar_n) - \\exp(\\mu)\\right) \\indist \\N(0, \\exp(2\\mu)\\sigma^2),\n$$\nsince $\\exp(\\mu)^2 = \\exp(2\\mu)$. \n\n:::\n\n\nLike all of the results in this chapter, there is a multivariate version of the delta method that is incredibly useful in practical applications. We often will combine two different estimators (or two different estimated parameters) to estimate another quantity. We now let $\\mb{h}(\\mb{\\theta}) = (h_1(\\mb{\\theta}), \\ldots, h_m(\\mb{\\theta}))$ map from $\\mathbb{R}^k \\to \\mathbb{R}^m$ and be continuously differentiable (we make the function bold since it ). It will help us use more compact matrix notation if we introduce a $m \\times k$ Jacobian matrix of all partial derivatives\n$$\n\\mb{H}(\\mb{\\theta}) = \\mb{\\nabla}_{\\mb{\\theta}}\\mb{h}(\\mb{\\theta}) = \\begin{pmatrix}\n \\frac{\\partial h_1(\\mb{\\theta})}{\\partial \\theta_1} & \\frac{\\partial h_1(\\mb{\\theta})}{\\partial \\theta_2} & \\cdots & \\frac{\\partial h_1(\\mb{\\theta})}{\\partial \\theta_k} \\\\\n \\frac{\\partial h_2(\\mb{\\theta})}{\\partial \\theta_1} & \\frac{\\partial h_2(\\mb{\\theta})}{\\partial \\theta_2} & \\cdots & \\frac{\\partial h_2(\\mb{\\theta})}{\\partial \\theta_k} \\\\\n \\vdots & \\vdots & \\ddots & \\vdots \\\\\n \\frac{\\partial h_m(\\mb{\\theta})}{\\partial \\theta_1} & \\frac{\\partial h_m(\\mb{\\theta})}{\\partial \\theta_2} & \\cdots & \\frac{\\partial h_m(\\mb{\\theta})}{\\partial \\theta_k} \n\\end{pmatrix},\n$$\nwhich we can use to generate the equivalent multivariate linear approximation\n$$\n\\left(\\mb{h}(\\widehat{\\mb{\\theta}}_n) - \\mb{h}(\\mb{\\theta}_0)\\right) \\approx \\mb{H}(\\mb{\\theta}_0)'\\left(\\widehat{\\mb{\\theta}}_n - \\mb{\\theta}_0\\right).\n$$\nWe can use this fact to derive the multivariate delta method.\n\n::: {#thm-multivariate-delta}\n\nSuppose that $\\sqrt{n}\\left(\\widehat{\\mb{\\theta}}_n - \\mb{\\theta}_0 \\right) \\indist \\N(0, \\mb{\\Sigma})$, then for any function $\\mb{h}$ that is continuously differentiable in a neighborhood of $\\mb{\\theta}_0$, we have\n$$\n\\sqrt{n}\\left(\\mb{h}(\\widehat{\\mb{\\theta}}_n) - \\mb{h}(\\mb{\\theta}_0) \\right) \\indist \\N(0, \\mb{H}\\mb{\\Sigma}\\mb{H}'), \n$$\nwhere $\\mb{H} = \\mb{H}(\\mb{\\theta}_0)$.\n:::\n\n\nThis result follows from the approximation above plus rules about variances of random vectors. Remember that for any compatible matrix of constants, $\\mb{A}$, we have $\\V[\\mb{A}'\\mb{Z}] = \\mb{A}\\V[\\mb{Z}]\\mb{A}'$. You can see that the matrix of constants appears twice here, like the matrix version of the \"squaring the constant\" rule for variance. \n\nThe delta method is handy for generating closed-form approximations for asymptotic standard errors, but the math is often quite complex for even simple estimators. It is usually more straightforward for applied researchers to use computational tools like the bootstrap to approximate the standard errors we need. The bootstrap has the trade-off of taking more computational time to implement than the delta method. Still, it is more easily adaptable across different estimators and domains with little human thinking time. \n", + "markdown": "# Asymptotics {#sec-asymptotics}\n\n\n## Introduction\n\nIn the last chapter, we defined estimators and started investigating their finite-sample properties like unbiasedness and the sampling variance. We call these \"finite-sample\" properties since they hold at any sample size. We saw that under iid data, the sample mean is unbiased for the population mean, but this result holds as much for $n = 10$ as it does for $n = 1,000,000$. But these properties are also of limited use: we only learn the center and spread of the sampling distribution of $\\Xbar_n$ from these results. What about the shape of the distribution? We can often derive the shape if we are willing to make certain assumptions on the underlying data (for example, if the data is normal, then the sample means will also be normal). Still, this approach is brittle: if our parametric assumption is false, we're back to square one. \n\nIn this chapter, we will take a different approach and see what happens to the sampling distribution of estimators as the sample size gets large, which we refer to as **asymptotic theory**. While asymptotics will often simplify our derivations, it is essential to understand everything we do with asymptotics will be an approximation. No one ever has infinite data, but we hope that the approximations will be closer to the truth as our samples get larger. \n\n## Why convergence with probability is hard\n\nIt's helpful to review the basic idea of convergence in deterministic sequences from calculus: \n\n::: {#def-limit}\nA sequence $\\{a_n: n = 1, 2, \\ldots\\}$ has the **limit** $a$ written $a_n \\rightarrow a$ as $n\\rightarrow \\infty$ or $\\lim_{n\\rightarrow \\infty} a_n = a$ if for all $\\epsilon > 0$ there is some $n_{\\epsilon} < \\infty$ such that for all $n \\geq n_{\\epsilon}$, $|a_n - a| \\leq \\epsilon$. \n:::\n\nWe say that $a_n$ **converges** to $a$ if $\\lim_{n\\rightarrow\\infty} a_n = a$. Basically, a sequence converges to a number if the sequence gets closer and closer to that number as the sequence goes on. \n\nCan we apply this same idea to sequences of random variables (like estimators)? Let's look at a few examples that help clarify why this might be difficult.[^wass] Let's say that we have a sequence of $a_n = a$ for all $n$ (that is, a constant sequence). Then obviously $\\lim_{n\\rightarrow\\infty} a_n = a$. Now let's say we have a sequence of random variables, $X_1, X_2, \\ldots$, that are all independent with a standard normal distribution, $N(0,1)$. From the analogy to the deterministic case, it is tempting to say that $X_n$ converges to $X \\sim N(0, 1)$, but notice that because they are all different random variables, $\\P(X_n = X) = 0$. Thus, we must be careful about saying how one variable converges to another variable. \n\nAnother example highlights subtle problems with a sequence of random variables converging to a single value. Suppose we have a sequence of random variables $X_1, X_2, \\ldots$ where $X_n \\sim N(0, 1/n)$. Clearly, the distribution of $X_n$ will concentrate around 0 for large values of $n$, so it is tempting to say that $X_n$ converges to 0. But notice that $\\P(X_n = 0) = 0$ because of the nature of continuous random variables. \n\n\n\n\n[^wass]: Due to Wasserman (2004), Chapter 5.\n\n\n## Convergence in probability and consistency\n\nThere are several different ways that a sequence of random variance can converge. The first type of convergence deals with sequences converging to a single value.[^inprob]\n\n[^inprob]: Technically, a sequence can also converge in probability to another random variable, but the use case of converging to a single number is much more common in evaluating estimators. \n\n::: {#def-inprob}\nA sequence of random variables, $X_1, X_2, \\ldots$, is said to **converge in probability** to a value $b$ if for every $\\varepsilon > 0$,\n$$\n\\P(|X_n - b| > \\varepsilon) \\rightarrow 0,\n$$\nas $n\\rightarrow \\infty$. We write this $X_n \\inprob b$. \n:::\n\nWith deterministic sequences, we said that $a_n$ converges to $a$ as it gets closer and closer to $a$ as $n$ gets bigger. For convergence in probability, the sequence of random variables converges to $b$ if the probability that random variables are far away from $b$ gets smaller and smaller as $n$ gets big. \n\n\n::: {.callout-note}\n\n## Notation alert\n\nYou will sometimes see convergence in probability written as $\\text{plim}(Z_n) = b$ if $Z_n \\inprob b$, $\\text{plim}$ stands for \"probability limit.\"\n\n:::\n\n\nConvergence in probability is crucial for evaluating estimators. While we said that unbiasedness was not the be-all and end-all of properties of estimators, the following property is an essential and fundamental property that we would like all good estimators to have. \n\n::: {#def-consistency}\nAn estimator is **consistent** if $\\widehat{\\theta}_n \\inprob \\theta$. \n:::\n\nConsistency of an estimator implies that the sampling distribution of this estimator \"collapses\" on the true value as the sample size gets large. We say an estimator is inconsistent if it converges in probability to any other value, which is obviously a terrible property of an estimator. As the sample size gets large, the probability that an inconsistent estimator will be close to the truth will approach 0. \n\nWe can also define convergence in probability for a sequence of random vectors, $\\X_1, \\X_2, \\ldots$, where $\\X_i = (X_{i1}, \\ldots, X_{ik})$ is a random vector of length $k$. This sequence convergences in probability to a vector $\\mb{b} = (b_1, \\ldots, b_k)$ if and only if each random variable in the vector converges to the corresponding element in $\\mb{b}$, or that $X_{nj} \\inprob b_j$ for all $j = 1, \\ldots, k$. \n\n\n## Useful inequalities\n\nAt first glance, establishing an estimator's consistency will be difficult. How can we know if a distribution will collapse to a specific value without knowing the shape or family of the distribution? It turns out that there are certain relationships between the mean and variance of a random variable and certain probability statements that hold for all distributions (that have finite variance, at least). These relationships will be crucial to establishing results that do not depend on a specific distribution. \n\n\n::: {#thm-markov}\n\n## Markov Inequality\n\nFor any r.v. $X$ and any $\\delta >0$, \n$$\n\\P(|X| \\geq \\delta) \\leq \\frac{\\E[|X|]}{\\delta}.\n$$\n:::\n\n::: {.proof}\n\nNotice that we can let $Y = |X|/\\delta$ and rewrite the statement as $\\P(Y \\geq 1) \\leq \\E[Y]$ (since $\\E[|X|]/\\delta = \\E[|X|/\\delta]$ by the properties of expectation), which is what we will show. But notice that\n$$\n\\mathbb{I}(Y \\geq 1) \\leq Y.\n$$\nWhy does this hold? We can investigate the two possible values of the indicator function to see. If $Y$ is less than 1, then the indicator function will be 0, but recall that $Y$ is nonnegative, so we know that it must be at least as big as 0 so that inequality holds. If $Y \\geq 1$, then the indicator function will take the value one, but we just said that $Y \\geq 1$, so the inequality holds. If we take the expectation of both sides of this inequality, we obtain the result (remember, the expectation of an indicator function is the probability of the event).\n\n\n:::\n\nIn words, Markov's inequality says that the probability of a random variable being large in magnitude cannot be high if the average is not large in magnitude. Blitzstein and Hwang (2019) provide an excellent intuition behind this result. Let $X$ be the income of a randomly selected individual in a population and set $\\delta = 2\\E[X]$ so that the inequality becomes $\\P(X > 2\\E[X]) < 1/2$ (assuming that all income is nonnegative). Here, the inequality says that the share of the population with an income twice the average must be less than 0.5 since if more than half the population were making twice the average income, then the average would have to be higher. \n\nIt's pretty astounding how general this result is since it holds for all random variables. Of course, its generality comes at the expense of not being very informative. If $\\E[|X|] = 5$, for instance, the inequality tells us that $\\P(|X| \\geq 1) \\leq 5$, which is not very helpful since we already know that probabilities are less than 1! We can get tighter bounds if we are willing to make some assumptions about $X$. \n\n::: {#thm-chebyshev}\n## Chebyshev Inequality\nSuppose that $X$ is r.v. for which $\\V[X] < \\infty$. Then, for every real number $\\delta > 0$,\n$$\n\\P(|X-\\E[X]| \\geq \\delta) \\leq \\frac{\\V[X]}{\\delta^2}.\n$$\n:::\n\n\n::: {.proof}\nTo prove this, we only need to square both sides of the inequality inside the probability statement and apply Markov's inequality:\n$$\n\\P\\left( |X - \\E[X]| \\geq \\delta \\right) = \\P((X-\\E[X])^2 \\geq \\delta^2) \\leq \\frac{\\E[(X - \\E[X])^2]}{\\delta^2} = \\frac{\\V[X]}{\\delta^2},\n$$\nwith the last equality holding by the definition of variance. \n:::\n\nChebyshev's inequality is a straightforward extension of the Markov result: the probability of a random variable being far from its mean (that is, $|X-\\E[X]|$ being large) is limited by the variance of the random variable. If we let $\\delta = c\\sigma$, where $\\sigma$ is the standard deviation of $X$, then we can use this result to bound the normalized:\n$$\n\\P\\left(\\frac{|X - \\E[X]|}{\\sigma} > c \\right) \\leq \\frac{1}{c^2}.\n$$\nThis statement says the probability of being two standard deviations away from the mean must be less than 1/4 = 0.25. Notice that this bound can be fairly wide. If $X$ has a normal distribution, we know that about 5% of draws will be greater than 2 SDs away from the mean, much lower than the 25% bound implied by Chebyshev's inequality. \n\n## The law of large numbers\n\nWe can now use these inequalities to show how estimators can be consistent for their target quantities of interest. Why are these inequalities helpful for this purpose? Remember that convergence in probability was about the probability of an estimator being far away from a value going to zero. Chebyshev's inequality shows that we can bound these exact probabilities. \n\nThe most famous consistency result has a special name.\n\n::: {#thm-lln}\n\n## Weak Law of Large Numbers\nLet $X_1, \\ldots, X_n$ be i.i.d. draws from a distribution with mean $\\mu = \\E[X_i]$ and variance $\\sigma^2 = \\V[X_i] < \\infty$. Let $\\Xbar_n = \\frac{1}{n} \\sum_{i =1}^n X_i$. Then, $\\Xbar_n \\inprob \\mu$.\n\n:::\n\n::: {.proof}\nRecall that the sample mean is unbiased, so $\\E[\\Xbar_n] = \\mu$ with sampling variance $\\sigma^2/n$. We can then apply Chebyshev to the sample mean to get\n$$\n\\P(|\\Xbar_n - \\mu| \\geq \\delta) \\leq \\frac{\\sigma^2}{n\\delta^2}\n$$\nAn $n\\rightarrow\\infty$, the right-hand side goes to 0, which means that the left-hand side also must go to 0, which is the definition of $\\Xbar_n$ converging in probability to $\\mu$. \n:::\n\nThe weak law of large numbers (WLLN) shows that, under general conditions, the sample mean gets closer to the population mean as $n\\rightarrow\\infty$. This result holds even when the variance of the data is infinite, though that's a situation that most analysts will rarely face. \n\n::: {.callout-note}\n\nThe naming of the \"weak\" law of large numbers seems to imply the existence of a \"strong\" law of large numbers (SLLN), and this is true. The SLLN states that the sample mean converges to the population mean with probability 1. This type of convergence, called **almost sure convergence**, is stronger than convergence in probability which only says that the probability of the sample mean being close to the population mean converges to 1. While it is nice to know that this stronger form of convergence holds for the sample mean under the same assumptions, it is rare for folks outside of theoretical probability and statistics to rely on almost sure convergence. \n\n:::\n\n\n\n::: {#exm-lln}\n\nIt can be helpful to see how the distribution of the sample mean changes as a function of the sample size to appreciate the WLLN. We can show this by taking repeated iid samples of different sizes from an exponential random variable with rate parameter 0.5 so that $\\E[X_i] = 2$. In @fig-lln-sim, we show the distribution of the sample mean (across repeated samples) when the sample size is 15 (black), 30 (violet), 100 (blue), and 1000 (green). We can see how the sample mean distribution is \"collapsing\" on the true population mean, 2. The probability of being far away from 2 becomes progressively smaller. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sampling distribution of the sample mean as a function of sample size.](03_asymptotics_files/figure-pdf/fig-lln-sim-1.pdf){#fig-lln-sim}\n:::\n:::\n\n\n\n\n:::\n\n\nThe WLLN also holds for random vectors in addition to random variables. Let $(\\X_1, \\ldots, \\X_n)$ be an iid sample of random vectors of length $k$, $\\mb{X}_i = (X_{i1}, \\ldots, X_{ik})$. We can define the vector sample mean as just the vector of sample means for each of the entries:\n\n$$\n\\overline{\\mb{X}}_n = \\frac{1}{n} \\sum_{i=1}^n \\mb{X}_i =\n\\begin{pmatrix}\n\\Xbar_{n,1} \\\\ \\Xbar_{n,2} \\\\ \\vdots \\\\ \\Xbar_{n, k}\n\\end{pmatrix}\n$$\nSince this is just a vector of sample means, each random variable in the random vector will converge in probability to the mean of that random variable. Fortunately, this is the exact definition of convergence in probability for random vectors. We formally write this in the following theorem.\n\n::: {#thm-vector-wlln}\n\nIf $\\X_i \\in \\mathbb{R}^k$ are iid draws from a distribution with $\\E[X_{ij}] < \\infty$ for all $j=1,\\ldots,k$ then as $n\\rightarrow\\infty$ \n\n$$\n\\overline{\\mb{X}}_n \\inprob \\E[\\X] =\n\\begin{pmatrix}\n\\E[X_{i1}] \\\\ \\E[X_{i2}] \\\\ \\vdots \\\\ \\E[X_{ik}]\n\\end{pmatrix}.\n$$\n:::\n\n\n::: {.callout-note}\n\n## Notation alert\n\nYou will have noticed that many of the formal results we have presented so far have \"moment conditions\" that certain moments are finite. For the vector WLLN, we saw that applied to the mean of each variable in the vector. Some books use a shorthand for this: $\\E\\Vert \\X_i\\Vert < \\infty$, where\n$$\n\\Vert\\X_i\\Vert = \\left(X_{i1}^2 + X_{i2}^2 + \\ldots + X_{ik}^2\\right)^{1/2}. \n$$\nThis expression has slightly more compact notation, but why does it work? One can show that this function, called the **Euclidean norm** or $L_2$-norm, is a **convex** function, so we can apply Jensen's inequality to show that:\n$$\n\\E\\Vert \\X_i\\Vert \\geq \\Vert \\E[\\X_i] \\Vert = (\\E[X_{i1}]^2 + \\ldots + \\E[X_{ik}]^2)^{1/2}.\n$$\nSo if $\\E\\Vert \\X_i\\Vert$ is finite, all the component means are finite. Otherwise, the right-hand side of the previous equation would be infinite. \n:::\n\n\n\n## Consistency of estimators\n\n\nThe WLLN shows that the sample mean of iid draws is consistent for the population mean, which is a massive result given that so many estimators are sample means of potentially complicated functions of the data. What about other estimators? The proof of the WLLN points to one way to determine if an estimator is consistent: if it is unbiased and the sampling variance shrinks as the sample size grows. \n\n\n::: {#thm-consis}\n\nFor any estimator $\\widehat{\\theta}_n$, if $\\text{bias}[\\widehat{\\theta}_n] = 0$ and $\\V[\\widehat{\\theta}_n] \\rightarrow 0$ as $n\\rightarrow \\infty$, then $\\widehat{\\theta}_n$ is consistent.\n\n:::\n\nThus, for unbiased estimators, if we can characterize its sampling variance, we should be able to tell if it is consistent. This result is handy since working with the probability statements used for the WLLN can sometimes be quite confusing. \n\nWhat about biased estimators? Consider a plug-in estimator like $\\widehat{\\alpha} = \\log(\\Xbar_n)$ where $X_1, \\ldots, X_n$ are iid from a population with mean $\\mu$. We know that for nonlinear functions like logarithms we have $\\log\\left(\\E[Z]\\right) \\neq \\E[\\log(Z)]$, so $\\E[\\widehat{\\alpha}] \\neq \\log(\\E[\\Xbar_n])$ and the plug-in estimator will be biased for $\\log(\\mu)$. It will also be difficult to obtain an expression for the bias in terms of $n$. Is all hope lost here? Must we give up on consistency? No, and in fact, consistency will be much simpler to show in this setting. \n\n\n::: {#thm-inprob-properties}\n\n## Properties of convergence in probability\n\nLet $X_n$ and $Z_n$ be two sequences of random variables such that $X_n \\inprob a$ and $Z_n \\inprob b$, and let $g(\\cdot)$ be a continuous function. Then, \n\n1. $g(X_n) \\inprob g(a)$ (continuous mapping theorem)\n2. $X_n + Z_n \\inprob a + b$\n3. $X_nZ_n \\inprob ab$\n4. $X_n/Z_n \\inprob a/b$ if $b > 0$.\n\n:::\n\nWe can now see that many of the nasty problems with expectations and nonlinear functions are made considerably easier with convergence in probability in the asymptotic setting. So while we know that $\\log(\\Xbar_n)$ is biased for $\\log(\\mu)$, we know that it is consistent since $\\log(\\Xbar_n) \\inprob \\log(\\mu)$ because $\\log$ is a continuous function. \n\n::: {#exm-nonresponse}\n\nSuppose we implemented a survey by randomly selecting a sample from the population of size $n$, but not everyone responded to our survey. Let the data consist of pairs of random variables, $(Y_1, R_1), \\ldots, (Y_n, R_n)$, where $Y_i$ is the question of interest and $R_i$ is a binary indicator for if the respondent answered the question ($R_i = 1$) or not ($R_i = 0$). Our goal is to estimate the mean of the question for responders: $\\E[Y_i \\mid R_i = 1]$. We can use the law of iterated expectation to obtain\n$$\n\\begin{aligned}\n\\E[Y_iR_i] &= \\E[Y_i \\mid R_i = 1]\\P(R_i = 1) + \\E[ 0 \\mid R_i = 0]\\P(R_i = 0) \\\\\n\\implies \\E[Y_i \\mid R_i = 1] &= \\frac{\\E[Y_iR_i]}{\\P(R_i = 1)}\n\\end{aligned}\n$$\n\nThe relevant estimator for this quantity is the mean of the outcome among those who responded, which is slightly more complicated than a typical sample mean because the denominator is a random variable:\n$$\n\\widehat{\\theta}_n = \\frac{\\sum_{i=1}^n Y_iR_i}{\\sum_{i=1}^n R_i}. \n$$\nNotice that this estimator is the ratio of two random variables. The numerator has mean $n\\E[Y_iR_i]$ and the denominator has mean $n\\P(R_i = 1)$. It is then tempting to say that we can take the ratio of these means as the mean of $\\widehat{\\theta}_n$, but expectations are not preserved in nonlinear functions like this. \n\nWe can establish consistency of our estimator, though, by noting that we can rewrite the estimator as a ratio of sample means\n$$\n\\widehat{\\theta}_n = \\frac{(1/n)\\sum_{i=1}^n Y_iR_i}{(1/n)\\sum_{i=1}^n R_i},\n$$\nwhere by the WLLN the numerator $(1/n)\\sum_{i=1}^n Y_iR_i \\inprob \\E[Y_iR_i]$ and the denominator $(1/n)\\sum_{i=1}^n R_i \\inprob \\P(R_i = 1)$. Thus, by @thm-inprob-properties, we have \n$$\n\\widehat{\\theta}_n = \\frac{(1/n)\\sum_{i=1}^n Y_iR_i}{(1/n)\\sum_{i=1}^n R_i} \\inprob \\frac{\\E[Y_iR_i]}{\\P[R_i = 1]} = \\E[Y_i \\mid R_i = 1],\n$$\nso long as the probability of responding is greater than zero. This establishes that our sample mean among responders, while biased for the conditional expectation among responders, is consistent for that quantity. \n:::\n\n\nKeeping the difference between unbiased and consistent clear in your mind is essential. You can easily create ridiculous unbiased estimators that are inconsistent. Let's return to our iid sample, $X_1, \\ldots, X_n$, from a population with $E[X_i] = \\mu$. There is nothing in the rule book against defining an estimator $\\widehat{\\theta}_{first} = X_1$ that uses the first observation as the estimate. This estimator is silly, but it is unbiased since $\\E[\\widehat{\\theta}_{first}] = \\E[X_1] = \\mu$. It is inconsistent since the sampling variance of this estimator is just the variance of the population distribution, $\\V[\\widehat{\\theta}_{first}] = \\V[X_i] = \\sigma^2$, which does not change as a function of the sample size. Generally speaking, we can regard \"unbiased but inconsistent\" estimators as silly and not worth our time (along with biased and inconsistent estimators). \n\nSome estimators are biased but consistent that are often much more interesting. We already saw one such estimator in @exm-nonresponse, but there are many more. Maximum likelihood estimators, for example, are (under some regularity conditions) consistent for the parameters of a parametric model but are often biased. \n\nTo study these estimator, we can broaden @thm-consis to the class of **asymptotically unbiased** estimators that have bias that vanishes as the sample size grows. \n\n::: {#thm-consis-2}\n\nFor any estimator $\\widehat{\\theta}_n$, if $\\text{bias}[\\widehat{\\theta}_n] \\to 0$ and $\\V[\\widehat{\\theta}_n] \\rightarrow 0$ as $n\\rightarrow \\infty$, then $\\widehat{\\theta}_n$ is consistent.\n\n:::\n\n\n\n::: {#exm-plug-in-variance}\n\n## Plug-in variance estimator\n\nIn the last chapter, we introduced the plug-in estimator for the population variance, \n$$\n\\widehat{\\sigma}^2 = \\frac{1}{n} \\sum_{i=1}^n (X_i - \\Xbar_n)^2,\n$$\nwhich we will now show is biased but consistent. To see the bias note that we can rewrite the sum of square deviations $$\\sum_{i=1}^n (X_i - \\Xbar_n)^2 = \\sum_{i=1}^n X_i^2 - n\\Xbar_n^2. $$\nThen, the expectation of the plug-in estimator is\n$$\n\\begin{aligned}\n\\E[\\widehat{\\sigma}^2] & = \\E\\left[\\frac{1}{n}\\sum_{i=1}^n X_i^2\\right] - \\E[\\Xbar_n^2] \\\\\n&= \\E[X_i^2] - \\frac{1}{n^2}\\sum_{i=1}^n \\sum_{j=1}^n \\E[X_iX_j] \\\\\n&= \\E[X_i^2] - \\frac{1}{n^2}\\sum_{i=1}^n \\E[X_i^2] - \\frac{1}{n^2}\\sum_{i=1}^n \\sum_{j\\neq i} \\underbrace{\\E[X_i]\\E[X_j]}_{\\text{independence}} \\\\\n&= \\E[X_i^2] - \\frac{1}{n}\\E[X_i^2] - \\frac{1}{n^2} n(n-1)\\mu^2 \\\\\n&= \\frac{n-1}{n} \\left(\\E[X_i^2] - \\mu^2\\right) \\\\\n&= \\frac{n-1}{n} \\sigma^2 = \\sigma^2 - \\frac{1}{n}\\sigma^2\n\\end{aligned}. \n$$\nThus, we can see that the bias of the plug-in estimator is $-(1/n)\\sigma^2$, so it slightly underestimates the variance. Nicely, though, the bias shrinks as a function of the sample size, so according to @thm-consis-2, it will be consistent so long as the sampling variance of $\\widehat{\\sigma}^2$ shrinks as a function of the sample size, which it does (though omit that proof here). Of course, simply multiplying this estimator by $n/(n-1)$ will give an unbiased and consistent estimator that is also the typical sample variance estimator. \n\n:::\n\n## Convergence in distribution and the central limit theorem\n\nConvergence in probability and the law of large numbers are beneficial for understanding how our estimators will (or will not) collapse to their estimand as the sample size increases. But what about the shape of the sampling distribution of our estimators? For statistical inference, we would like to be able to make probability statements such as $\\P(a \\leq \\widehat{\\theta}_n \\leq b)$. These statements will be the basis of hypothesis testing and confidence intervals. But to make those types of statements, we need to know the entire distribution of $\\widehat{\\theta}_n$, not just the mean and variance. Luckily, established results will allow us to approximate the sampling distribution of a vast swath of estimators when our sample sizes are large. \n\nWe need first to describe a weaker form of convergence to see how we will develop these approximations.\n\n\n::: {#def-indist}\nLet $X_1,X_2,\\ldots$, be a sequence of r.v.s, and for $n = 1,2, \\ldots$ let $F_n(x)$ be the c.d.f. of $X_n$. Then it is said that $X_1, X_2, \\ldots$ **converges in distribution** to r.v. $X$ with c.d.f. $F(x)$ if\n$$\n\\lim_{n\\rightarrow \\infty} F_n(x) = F(x),\n$$\nfor all values of $x$ for which $F(x)$ is continuous. We write this as $X_n \\indist X$ or sometimes $X_n ⇝ X$. \n:::\n\nEssentially, convergence in distribution means that as $n$ gets large, the distribution of $X_n$ becomes more and more similar to the distribution of $X$, which we often call the **asymptotic distribution** of $X_n$ (other names include the **large-sample distribution**). If we know that $X_n \\indist X$, then we can use the distribution of $X$ as an approximation to the distribution of $X_n$, and that distribution can be reasonably accurate. \n\nOne of the most remarkable results in probability and statistics is that a large class of estimators will converge in distribution to one particular family of distributions: the normal. This result is one reason we study the normal so much and why investing in building intuition about it will pay off across many domains of applied work. We call this broad class of results the \"central limit theorem\" (CLT), but it would probably be more accurate to refer to them as \"central limit theorems\" since much of statistics is devoted to showing the result in different settings. We now present the simplest CLT for the sample mean. \n\n\n::: {#thm-clt}\n\n## Central Limit Theorem \nLet $X_1, \\ldots, X_n$ be i.i.d. r.v.s from a distribution with mean $\\mu = \\E[X_i]$ and variance $\\sigma^2 = \\V[X_i]$. Then if $\\E[X_i^2] < \\infty$, we have\n$$\n\\frac{\\Xbar_n - \\mu}{\\sqrt{\\V[\\Xbar_n]}} = \\frac{\\sqrt{n}\\left(\\Xbar_n - \\mu\\right)}{\\sigma} \\indist \\N(0, 1).\n$$\n:::\n\nIn words: the sample mean of a random sample from a population with finite mean and variance will be approximately normally distributed in large samples. Notice how we have not made any assumptions about the distribution of the underlying random variables, $X_i$. They could be binary, event count, continuous, or anything. The CLT is incredibly broadly applicable. \n\n::: {.callout-note}\n\n## Notation alert\n\nWhy do we state the CLT in terms of the sample mean after centering and scaling by its standard error? Suppose we don't normalize the sample mean in this way. In that case, it isn't easy to talk about convergence in distribution because we know from the WLLN that $\\Xbar_n \\inprob \\mu$, so in the limit, the distribution of $\\Xbar_n$ is concentrated at point mass around that value. Normalizing by centering and rescaling ensures that the variance of the resulting quantity will not depend on $n$, so it makes sense to talk about its distribution converging. Sometimes you will see the equivalent result as\n$$\n\\sqrt{n}\\left(\\Xbar_n - \\mu\\right) \\indist \\N(0, \\sigma^2).\n$$\n:::\n\nWe can use this result to state approximations that we can use when discussing estimators such as\n$$\n\\Xbar_n \\overset{a}{\\sim} N(\\mu, \\sigma^2/n),\n$$\nwhere we use $\\overset{a}{\\sim}$ to be \"approximately distributed as in large samples.\" This approximation allows us to say things like: \"in large samples, we should expect the sample mean to between within $2\\sigma/\\sqrt{n}$ of the true mean in 95% of repeated samples.\" These statements will be essential for hypothesis tests and confidence intervals! Estimators so often follow the CLT that we have an expression for this property.\n\n::: {#def-asymptotically-normal}\n\nAn estimator $\\widehat{\\theta}_n$ is **asymptotically normal** if for some $\\theta$\n$$\n\\sqrt{n}\\left( \\widehat{\\theta}_n - \\theta \\right) \\indist N\\left(0,\\V_{\\theta}\\right).\n$$\n\n:::\n\n::: {#exm-bin-clt}\n\nTo illustrate how the CLT works, we can simulate the sampling distribution of the (normalized) sample mean at different sample sizes. Let $X_1, \\ldots, X_n$ be iid samples from a Bernoulli with probability of success 0.25. We then draw repeated samples of size $n=30$ and $n=100$ and calculate $\\sqrt{n}(\\Xbar_n - 0.25)/\\sigma$ for each random sample. @fig-clt plots the density of these two sampling distributions along with a standard normal reference. We can see that even at $n=30$, the rough shape of the density looks normal, with spikes and valleys due to the discrete nature of the data (the sample mean can only take on 31 possible values in this case). By $n=100$, the sampling distribution is very close to the true standard normal. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sampling distributions of the normalized sample mean at n=30 and n=100.](03_asymptotics_files/figure-pdf/fig-clt-1.pdf){#fig-clt}\n:::\n:::\n\n\n\n\n:::\n\n\nThere are several properties of convergence in distribution that are helpful to us. \n\n::: {#thm-indist-properties}\n\n## Properties of convergence in distribution\n\nLet $X_n$ be a sequence of random variables $X_1, X_2,\\ldots$ that converges in distribution to some rv $X$ and let $Y_n$ be a sequence of random variables $Y_1, Y_2,\\ldots$ that converges in probability to some number, $c$. Then, \n\n1. $g(X_n) \\indist g(X)$ for all continuous functions $g$.\n2. $X_nY_n$ converges in distribution to $cX$\n3. $X_n + Y_n$ converges in distribution to $X + c$\n4. $X_n / Y_n$ converges in distribution to $X / c$ if $c \\neq 0$\n\n:::\n\nWe refer to the last three results as **Slutsky's theorem**. These results are often crucial for determining an estimator's asymptotic distribution. \n\nA critical application of Slutsky's theorem is when we replace the (unknown) population variance in the CLT with an estimate. Recall the definition of the **sample variance** as\n$$\nS_n^2 = \\frac{1}{n-1} \\sum_{i=1}^n (X_i - \\Xbar_n)^2,\n$$\nwith the **sample standard deviation** defined as $S_{n} = \\sqrt{S_{n}^2}$. It's easy to show that these are consistent estimators for their respective population parameters\n$$ \nS_{n}^2 \\inprob \\sigma^2 = \\V[X_i], \\qquad S_{n} \\inprob \\sigma,\n$$\nwhich, by Slutsky's theorem, implies that\n$$\n\\frac{\\sqrt{n}\\left(\\Xbar_n - \\mu\\right)}{S_n} \\indist \\N(0, 1)\n$$\nComparing this result to the statement of CLT, we see that replacing the population variance with a consistent estimate of the variance (or standard deviation) does not affect the asymptotic distribution. \n\n\nLike with the WLLN, the CLT holds for random vectors of sample means, where their centered and scaled versions converge to a multivariate normal distribution with a covariance matrix equal to the covariance matrix of the underlying random vectors of data, $\\X_i$. \n\n::: {#thm-multivariate-clt}\n\nIf $\\mb{X}_i \\in \\mathbb{R}^k$ are i.i.d. and $\\E\\Vert \\mb{X}_i \\Vert^2 < \\infty$, then as $n \\to \\infty$,\n$$\n\\sqrt{n}\\left( \\overline{\\mb{X}}_n - \\mb{\\mu}\\right) \\indist \\N(0, \\mb{\\Sigma}),\n$$\nwhere $\\mb{\\mu} = \\E[\\mb{X}_i]$ and $\\mb{\\Sigma} = \\V[\\mb{X}_i] = \\E\\left[(\\mb{X}_i-\\mb{\\mu})(\\mb{X}_i - \\mb{\\mu})'\\right]$.\n\n:::\n\nNotice that $\\mb{\\mu}$ is the vector of population means for all the random variables in $\\X_i$ and $\\mb{\\Sigma}$ is the variance-covariance matrix for that vector.\n\n\n::: {.callout-note}\n\nAs with the notation alert with the WLLN, we are using shorthand here, $\\E\\Vert \\mb{X}_i \\Vert^2 < \\infty$, which implies that $\\E[X_{ij}^2] < \\infty$ for all $j = 1,\\ldots, k$, or equivalently, that the variances of each variable in the sample means has finite variance. \n\n:::\n\n## Confidence intervals\n\n\nWe now turn to an essential application of the central limit theorem: confidence intervals. \n\nYou have run your experiment and presented your readers with your single best guess about the treatment effect with the difference in sample means. You may have also presented the estimated standard error of this estimate to give readers a sense of how variable the estimate is. But none of these approaches answer a fairly compelling question: what range of values of the treatment effect is **plausible** given the data we observe?\n\nA point estimate typically has 0 probability of being the exact true value, but intuitively we hope that the true value is close to this estimate. **Confidence intervals** make this kind of intuition more formal by instead estimating ranges of values with a fixed percentage of these ranges containing the actual parameter value. \n\nWe begin with the basic definition of a confidence interval. \n\n::: {#def-coverage}\nA $1-\\alpha$ **confidence interval** for a real-valued parameter $\\theta$ is a pair of statistics $L= L(X_1, \\ldots, X_n)$ and $U = U(X_1, \\ldots, X_n)$ such that $L < U$ for all values of the sample and such that \n$$ \n\\P(L \\leq \\theta \\leq U \\mid \\theta) \\geq 1-\\alpha, \\quad \\forall \\theta \\in \\Theta.\n$$\n:::\n\nWe say that a $1-\\alpha$ confidence interval covers (contains, captures, traps, etc.) the true value at least $100(1-\\alpha)\\%$ of the time, and we refer to $1-\\alpha$ as the **coverage probability** or simply **coverage**. Typical confidence intervals include 95% percent ($\\alpha = 0.05$) and 90% ($\\alpha = 0.1$). \n\nSo a confidence interval is a random interval with a particular guarantee about how often it will contain the true value. It's important to remember what is random and what is fixed in this setup. The interval varies from sample to sample, but the true value of the parameter stays fixed, and the coverage is how often we should expect the interval to contain that true value. The \"repeating my sample over and over again\" analogy can break down very quickly, so it's sometimes helpful to interpret it as giving guarantees across confidence intervals across different experiments. In particular, suppose that a journal publishes 100 quantitative articles annually, each producing a single 95% confidence interval for their quantity of interest. Then, if the confidence intervals are valid, we should expect 95 of those confidence intervals to contain the true value. \n\n\n::: {.callout-warning}\n\nSuppose you calculate a 95% confidence interval, $[0.1, 0.4]$. It's tempting to make a probability statement like $\\P(0.1 \\leq \\theta \\leq 0.4 \\mid \\theta) = 0.95$ or that there's a 95% chance that the parameter is in $[0.1, 0.4]$. But looking at the probability statement, everything on the left-hand side of the conditioning bar is fixed, so the probability either has to be 0 ($\\theta$ is outside the interval) or 1 ($\\theta$ is in the interval). The coverage probability of a confidence interval refers to its status as a pair of random variables, $(L, U)$, not any particular realization of those variables like $(0.1, 0.4)$. As an analogy, consider if calculated the sample mean as $0.25$ and then try to say that $0.25$ is unbiased for the population mean. This statement doesn't make sense because unbiasedness refers to how the sample mean varies from sample to sample. \n\n:::\n\n\nIn most cases, we will not be able to derive exact confidence intervals but rather confidence intervals that are **asymptotically valid**, which means that if we write the interval as a function of the sample size, $(L_n, U_n)$, they would have **asymptotic coverage**\n$$\n\\lim_{n\\to\\infty} \\P(L_n \\leq \\theta \\leq U_n) \\geq 1-\\alpha \\quad\\forall\\theta\\in\\Theta.\n$$\n\nAsymptotic coverage is the property we can show for most confidence intervals since we usually rely on large sample approximations based on the central limit theorem. \n\n\n\n### Deriving confidence intervals\n\nIf you have taken any statistics before, you probably have seen the standard formula for the 95% confidence interval of the sample mean, \n$$ \n\\left[\\Xbar_n - 1.96\\frac{s}{\\sqrt{n}},\\; \\Xbar_n + 1.96\\frac{s}{\\sqrt{n}}\\right],\n$$\nwhere you can recall that $s$ is the sample standard deviation and $s/\\sqrt{n}$ is the estimate of the standard error of the sample mean. If this is a 95% confidence interval, then the probability that it contains the population mean $\\mu$ should be 0.95, but how can we derive this? We can justify this logic using the central limit theorem, and the argument will hold for any asymptotically normal estimator. \n\n\nLet's say that we have an estimator, $\\widehat{\\theta}_n$ for the parameter $\\theta$ with estimated standard error $\\widehat{\\se}[\\widehat{\\theta}_n]$. If the estimator is asymptotically normal, then in large samples, we know that\n$$ \n\\frac{\\widehat{\\theta}_n - \\theta}{\\widehat{\\se}[\\widehat{\\theta}_n]} \\sim \\N(0, 1).\n$$\nWe can then use our knowledge of the standard normal and the empirical rule to find\n$$ \n\\P\\left( -1.96 \\leq \\frac{\\widehat{\\theta}_n - \\theta}{\\widehat{\\se}[\\widehat{\\theta}_n]} \\leq 1.96\\right) = 0.95\n$$\nand by multiplying each part of the inequality by $\\widehat{\\se}[\\widehat{\\theta}_n]$, we get\n$$ \n\\P\\left( -1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n] \\leq \\widehat{\\theta}_n - \\theta \\leq 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n]\\right) = 0.95,\n$$\nWe then subtract all parts by the estimator to get\n$$ \n\\P\\left(-\\widehat{\\theta}_n - 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n] \\leq - \\theta \\leq -\\widehat{\\theta}_n + 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n]\\right) = 0.95,\n$$\nand finally we multiply all parts by $-1$ (and flipping the inequalities) to arrive at\n$$ \n\\P\\left(\\widehat{\\theta}_n - 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n] \\leq \\theta \\leq \\widehat{\\theta}_n + 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n]\\right) = 0.95.\n$$\nTo connect back to the definition of the confidence interval, we have now shown that the random interval $[L, U]$ where\n$$ \n\\begin{aligned}\n L = L(X_1, \\ldots, X_n) &= \\widehat{\\theta}_n - 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n] \\\\\n U = U(X_1, \\ldots, X_n) &= \\widehat{\\theta}_n + 1.96\\,\\widehat{\\se}[\\widehat{\\theta}_n],\n\\end{aligned}\n$$\nis an asymptotically valid estimator.[^1] Replacing $\\Xbar_n$ for $\\widehat{\\theta}_n$ and $s/\\sqrt{n}$ for $\\widehat{\\se}[\\widehat{\\theta}_n]$ establishes how the standard 95% confidence interval for the sample mean above is asymptotically valid. \n\n[^1]: Implicit in this analysis is that the standard error estimate is consistent. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Critical values for the standard normal.](03_asymptotics_files/figure-pdf/fig-std-normal-1.pdf){#fig-std-normal}\n:::\n:::\n\n\n\n\nHow can we generalize this to $1-\\alpha$ confidence intervals? For a standard normal rv, $Z$, we know that\n$$ \n\\P(-z_{\\alpha/2} \\leq Z \\leq z_{\\alpha/2}) = 1-\\alpha\n$$\nwhich implies that we can obtain a $1-\\alpha$ asymptotic confidence intervals by using the interval $[L, U]$, where\n$$ \nL = \\widehat{\\theta}_{n} - z_{\\alpha/2} \\widehat{\\se}[\\widehat{\\theta}_{n}], \\quad U = \\widehat{\\theta}_{n} + z_{\\alpha/2} \\widehat{\\se}[\\widehat{\\theta}_{n}]. \n$$\nThis is sometimes shortened to $\\widehat{\\theta}_n \\pm z_{\\alpha/2} \\widehat{\\se}[\\widehat{\\theta}_{n}]$. Remember that we can obtain the values of $z_{\\alpha/2}$ easily from R:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## alpha = 0.1 for 90% CI\nqnorm(0.1 / 2, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.644854\n```\n:::\n:::\n\n\n\nAs a concrete example, then, we could derive a 90% asymptotic confidence interval for the sample mean as \n$$ \n\\left[\\Xbar_{n} - 1.64 \\frac{\\widehat{\\sigma}}{\\sqrt{n}}, \\Xbar_{n} + 1.64 \\frac{\\widehat{\\sigma}}{\\sqrt{n}}\\right]\n$$\n\n\n### Interpreting confidence intervals\n\nRemember that the interpretation of confidence is how the random interval performs over repeated samples. A valid 95% confidence interval is a random interval containing the true value in 95% of samples. Simulating repeated samples helps clarify this. \n\n\n::: {#exm-cis}\n\nSuppose we are taking samples of size $n=500$ of random variables where $X_i \\sim \\N(1, 10)$, and we want to estimate the population mean $\\E[X] = 1$. To do so, we repeat the following steps:\n\n1. Draw a sample of $n=500$ from $\\N(1, 10)$. \n2. Calculate the 95% confidence interval sample mean $\\Xbar_n \\pm 1.96\\widehat{\\sigma}/\\sqrt{n}$. \n3. Plot the intervals along the x-axis and color them blue if they contain the truth (1) and red if not. \n\n@fig-ci-sim shows 100 iterations of these steps. We see that, as expected, most calculated CIs contain the true value. Five random samples produce intervals that fail to include 1, an exact coverage rate of 95%. Of course, this is just one simulation, and a different set of 100 random samples might have produced a slightly different coverage rate. The guarantee of the 95% confidence intervals is that if we were to continue to take these repeated samples, the long-run frequency of intervals covering the truth would approach 0.95. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![95% confidence intervals from 100 random samples. Intervals are blue if they contain the truth and red if they do not.](03_asymptotics_files/figure-pdf/fig-ci-sim-1.pdf){#fig-ci-sim}\n:::\n:::\n\n\n\n\n:::\n\n\n## Delta method {#sec-delta-method}\n\nSuppose that we know that an estimator follows the CLT, and so we have \n$$\n\\sqrt{n}\\left(\\widehat{\\theta}_n - \\theta \\right) \\indist \\N(0, V),\n$$\nbut we actually want to estimate $h(\\theta)$ so we use the plug-in estimator, $h(\\widehat{\\theta}_n)$. It seems like we should be able to apply part 1 of @thm-indist-properties. Still, the CLT established the large-sample distribution of the centered and scaled random sequence, $\\sqrt{n}(\\widehat{\\theta}_n - \\theta)$, not to the original estimator itself like we would need to investigate the asymptotic distribution of $h(\\widehat{\\theta}_n)$. We can use a little bit of calculus to get an approximation of the distribution we need. \n\n\n::: {#thm-delta-method}\n\nIf $\\sqrt{n}\\left(\\widehat{\\theta}_n - \\theta\\right) \\indist \\N(0, V)$ and $h(u)$ is continuously differentiable in a neighborhood around $\\theta$, then as $n\\to\\infty$,\n$$\n\\sqrt{n}\\left(h(\\widehat{\\theta}_n) - h(\\theta) \\right) \\indist \\N(0, (h'(\\theta))^2 V).\n$$\n\n:::\n\nUnderstanding what's happening here is useful since it might help give intuition as to when this might go wrong. Why do we focus on continuously differentiable functions, $h()$? These functions can be well-approximated with a line in a neighborhood around a given point like $\\theta$. In @fig-delta, we show this where the tangent line at $\\theta_0$, which has slope $h'(\\theta_0)$, is very similar to $h(\\theta)$ for values close to $\\theta_0$. Because of this, we can approximate the difference between $h(\\widehat{\\theta}_n)$ and $h(\\theta_0)$ with the what this tangent line would give us:\n$$\n\\underbrace{\\left(h(\\widehat{\\theta_n}) - h(\\theta_0)\\right)}_{\\text{change in } y} \\approx \\underbrace{h'(\\theta_0)}_{\\text{slope}} \\underbrace{\\left(\\widehat{\\theta}_n - \\theta_0\\right)}_{\\text{change in } x},\n$$\nand then multiplying both sides by the $\\sqrt{n}$ gives\n$$\n\\sqrt{n}\\left(h(\\widehat{\\theta_n}) - h(\\theta_0)\\right) \\approx h'(\\theta_0)\\sqrt{n}\\left(\\widehat{\\theta}_n - \\theta_0\\right). \n$$\nThe right-hand side of this approximation converges to $h'(\\theta_0)Z$, where $Z$ is a random variable with $\\N(0, V)$. The variance of this quantity will be\n$$\n\\V[h'(\\theta_0)Z] = (h'(\\theta_0))^2\\V[Z] = (h'(\\theta_0))^2V,\n$$\nby the properties of variances. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Linear approximation to nonlinear functions.](03_asymptotics_files/figure-pdf/fig-delta-1.pdf){#fig-delta}\n:::\n:::\n\n\n\n\n\n::: {#exm-log}\n\nLet's return to the iid sample $X_1, \\ldots, X_n$ with mean $\\mu = \\E[X_i]$ and variance $\\sigma^2 = \\V[X_i]$. From the CLT, we know that $\\sqrt{n}(\\Xbar_n - \\mu) \\indist \\N(0, \\sigma^2)$. Suppose that we want to estimate $\\log(\\mu)$, so we use the plug-in estimator $\\log(\\Xbar_n)$ (assuming that $X_i > 0$ for all $i$ so that we can take the log). What is the asymptotic distribution of this estimator? This is a situation where $\\widehat{\\theta}_n = \\Xbar_n$ and $h(\\mu) = \\log(\\mu)$. From basic calculus, we know that\n$$\nh'(\\mu) = \\frac{\\partial \\log(\\mu)}{\\partial \\mu} = \\frac{1}{\\mu},\n$$\nso applying the delta method, we can determine that \n$$\n\\sqrt{n}\\left(\\log(\\Xbar_n) - \\log(\\mu)\\right) \\indist \\N\\left(0,\\frac{\\sigma^2}{\\mu^2} \\right).\n$$\n\n:::\n\n::: {#exm-exp}\n\nWhat about estimating the $\\exp(\\mu)$ with $\\exp(\\Xbar_n)$? Recall that\n$$\nh'(\\mu) = \\frac{\\partial \\exp(\\mu)}{\\partial \\mu} = \\exp(\\mu)\n$$\nso applying the delta method, we have\n$$\n\\sqrt{n}\\left(\\exp(\\Xbar_n) - \\exp(\\mu)\\right) \\indist \\N(0, \\exp(2\\mu)\\sigma^2),\n$$\nsince $\\exp(\\mu)^2 = \\exp(2\\mu)$. \n\n:::\n\n\nLike all of the results in this chapter, there is a multivariate version of the delta method that is incredibly useful in practical applications. We often will combine two different estimators (or two different estimated parameters) to estimate another quantity. We now let $\\mb{h}(\\mb{\\theta}) = (h_1(\\mb{\\theta}), \\ldots, h_m(\\mb{\\theta}))$ map from $\\mathbb{R}^k \\to \\mathbb{R}^m$ and be continuously differentiable (we make the function bold since it returns an $m$-dimensional vector). It will help us to use more compact matrix notation if we introduce a $m \\times k$ Jacobian matrix of all partial derivatives\n$$\n\\mb{H}(\\mb{\\theta}) = \\mb{\\nabla}_{\\mb{\\theta}}\\mb{h}(\\mb{\\theta}) = \\begin{pmatrix}\n \\frac{\\partial h_1(\\mb{\\theta})}{\\partial \\theta_1} & \\frac{\\partial h_1(\\mb{\\theta})}{\\partial \\theta_2} & \\cdots & \\frac{\\partial h_1(\\mb{\\theta})}{\\partial \\theta_k} \\\\\n \\frac{\\partial h_2(\\mb{\\theta})}{\\partial \\theta_1} & \\frac{\\partial h_2(\\mb{\\theta})}{\\partial \\theta_2} & \\cdots & \\frac{\\partial h_2(\\mb{\\theta})}{\\partial \\theta_k} \\\\\n \\vdots & \\vdots & \\ddots & \\vdots \\\\\n \\frac{\\partial h_m(\\mb{\\theta})}{\\partial \\theta_1} & \\frac{\\partial h_m(\\mb{\\theta})}{\\partial \\theta_2} & \\cdots & \\frac{\\partial h_m(\\mb{\\theta})}{\\partial \\theta_k} \n\\end{pmatrix},\n$$\nwhich we can use to generate the equivalent multivariate linear approximation\n$$\n\\left(\\mb{h}(\\widehat{\\mb{\\theta}}_n) - \\mb{h}(\\mb{\\theta}_0)\\right) \\approx \\mb{H}(\\mb{\\theta}_0)'\\left(\\widehat{\\mb{\\theta}}_n - \\mb{\\theta}_0\\right).\n$$\nWe can use this fact to derive the multivariate delta method.\n\n::: {#thm-multivariate-delta}\n\nSuppose that $\\sqrt{n}\\left(\\widehat{\\mb{\\theta}}_n - \\mb{\\theta}_0 \\right) \\indist \\N(0, \\mb{\\Sigma})$, then for any function $\\mb{h}$ that is continuously differentiable in a neighborhood of $\\mb{\\theta}_0$, we have\n$$\n\\sqrt{n}\\left(\\mb{h}(\\widehat{\\mb{\\theta}}_n) - \\mb{h}(\\mb{\\theta}_0) \\right) \\indist \\N(0, \\mb{H}\\mb{\\Sigma}\\mb{H}'), \n$$\nwhere $\\mb{H} = \\mb{H}(\\mb{\\theta}_0)$.\n:::\n\n\nThis result follows from the approximation above plus rules about variances of random vectors. Remember that for any compatible matrix of constants, $\\mb{A}$, we have $\\V[\\mb{A}'\\mb{Z}] = \\mb{A}\\V[\\mb{Z}]\\mb{A}'$. You can see that the matrix of constants appears twice here, like the matrix version of the \"squaring the constant\" rule for variance. \n\nThe delta method is handy for generating closed-form approximations for asymptotic standard errors, but the math is often quite complex for even simple estimators. It is usually more straightforward for applied researchers to use computational tools like the bootstrap to approximate the standard errors we need. The bootstrap has the trade-off of taking more computational time to implement than the delta method. Still, it is more easily adaptable across different estimators and domains with little human thinking time. \n", "supporting": [ "03_asymptotics_files/figure-pdf" ], diff --git a/_freeze/03_asymptotics/figure-pdf/fig-ci-sim-1.pdf b/_freeze/03_asymptotics/figure-pdf/fig-ci-sim-1.pdf index d1749c45da60d3bc70cffa82207a1fe35ab98109..02adcb8ceaf2a0076727541f315b1398a7acae4e 100644 GIT binary patch delta 149 zcmbPYIK^v=VH6g;rrn&|u>IMetnq2z6`6(`mC8-J; zE>=bcM#hE)5G9*a#V#^BTR59J85)@xx>!28m{~ZvIvcn-S(q4Fm>4-48JRje+bP%( KQZl(xLK*-^b|m@$ delta 149 zcmbPYIK^v=VH6g;rrn&|u>IMetnq2z6`6(`mC8-J; zE>=bcM#hE)5G9*a#V#^B8yK6L8JnA%7&{u7y1E!RyINXUx|o=oo4FcW7+Dy)+9}u& KQZl(xLK*-r?Ief* diff --git a/_freeze/03_asymptotics/figure-pdf/fig-clt-1.pdf b/_freeze/03_asymptotics/figure-pdf/fig-clt-1.pdf index 1e78a5c452f6931942734b01803f430d2318140a..5f82ec4e3578a90e2f456b6a6c33c5a45e38e1a0 100644 GIT binary patch delta 153 zcmdlKvnggng&doanUR5^#pF6UF(`BMEIE5#4v4U^$>fLH(s0gZ7M=5q@rF*0ZYBms zCPs#421YKH#s=mFPKHKC&Q6AAP8J5H26hTI1eL@p*x7Lvmn0UIR1~GAaTyw#7+7+t Ks=E5SaRC6q1|^UH delta 153 zcmdlKvnggng&doKnW3qH$>cgYF(`BMEIE5#4v4U^>Ewsn(s0gZ7M=5q@h*nW=4M7_ zrcMS%CPpS^&gSOkZqBYwj>Z<2hNiB@ZgvVb1eL@p*x7Lvmn0UIR1~GAaTyw#7+7+t Ks=E5SaRC73b|yps diff --git a/_freeze/03_asymptotics/figure-pdf/fig-delta-1.pdf b/_freeze/03_asymptotics/figure-pdf/fig-delta-1.pdf index 2e3dc5914605476a944996a9a790478b247950bd..f4376335a62c1fa9c8c9f994c3dcacbd3a6a3530 100644 GIT binary patch delta 168 zcmdm^zej(=0(K=MGa~~-3u9AFE`8tp6qm%3R0RzeDIMetxRo25Zmt%6%;;?H=;&(c}FzS=3?$>VPfoLVC?K<>}X=AU_(gB IGec7Y6Jt|NE`8tp6qm%3R0RzeDbR8~n{BQZeaz_WVrJxQ=3?gNXklq$YHVm|Z0KfTY3%4~Xy9b%>}ujfFN(f}kFDTe?6 diff --git a/_freeze/03_asymptotics/figure-pdf/fig-lln-sim-1.pdf b/_freeze/03_asymptotics/figure-pdf/fig-lln-sim-1.pdf index ee45566a8fe96dec2a9a0224193ac493e01197d8..1503982e5b7a0b04dd16cff7bbd7060b1cc3f201 100644 GIT binary patch delta 12109 zcmaia1yCJ9w`Op6OK^ABi@WQ^-2x;Kg1Zg5xH|+1?j9hx69^8$gS)$3oaKM@wrXo% zZS73e%y&*7`KIU0Oix$e`u+rJ31H#)`2>^HIH`duf8`p7A(xqUpr5=q@We-NZf2Lt z0Cs4ZZoRUx$ooe|ItPP}j{pEPe&WyuW*pIlCz})JaK6!AUoJemJUmxGAXt z?hYI~UpBsEpK*5%+?j4wK<`DD^jNMQ8){x(Jk4G=j?RFGDDl@v;3~}wxCzS{_GH>Q zdIq|Hy9xlR^_;zP1iW15yjH@9x)Cl{lj``t|OUN=G1l z^F8%-Y1Q2sCX&&wE;Li#*7boX#2kBPJ&996RB1FZ}ce{AOqFI zS^hjLMI=OwjU98mD%oJ?*%PDt^~K!}t6|uy#uK|rNY<%9C+y7RZ`uLYcEPPD8_XA+ zWfE-BLqbvag~_#m&M#<#XU9%Cfp8yr+k;rvf0vNlv~9|=_9j;p*QE;jGtXqNXz@ZweMNOLOBH0 z-2BGB`v$)#lIe2&ogYar>W3zv+?Iy4UKneZ9iveY> zO)B=)3weX-eQyw2jVSEBnGyJ%7V{Gxy0Ew9BwXzGy_YA*S-Z`9_Ar06)&tk(bLY9p z9+yGKUmo9!Kab+OP47A38u~ef$$9nidN0{Y0pZfIe{c1!12by}G<>v12(POJtt$8l z3#6-iR{At$yKo=G+W|A)OyA%iZXlFTZ-E;TR@8h1?tefdv#YZNeZve~2QO9#rzgZ= z^-8531Em*-{Cg_}v)((f{FEbI#C=PdQ);PEIgsz9@WqBL`ygJ_WuIo`>;pC#^z5J4 z1+wsg)A!sZ#W4HAvD%~;?-ot_kRQ=;U2j>03J>bZJY2|tDunvlBr!s3)Yi9T*47Ni zh0?KSOk6^kf(fJ#!#4@c`|D*Eexxq=YG~7Y@$~!VxnxuJ+2nY_a_P`mP9NctNNE9o zVb7SY0v=cw405w_zI6wPgRke}LVcgwX_msDUNxcZDW`q1xci{f9K5YlY%=@Pie`f) z?r4~f5=4BEg*Sk()G03z7Ub^!2pWEjmCDf;4@+K6KjJe6XZ12D4FrlDFZpuO+Zc z%W?YF9|QRbZ)T&4$7pQEL@EKzFfc^)ue%)!6lcB8X8kb}sV$iMHe(jnG;D+}tcqfX z!9T7MM*f{ncJ-yU%`UrN82sFs=Ys+|%W9{^aze|T_#w6ZBrDO~jRH^|RMsg&2nnGr zcC7x5MJ_UfirFl@hr;wH`+>v*qy3skW?p~eam>E!l~0U)$MaT6Eu7lsxNfif(e|9kECR{Pj;G z7geL<5jWh)o`rC>^=ya(0~v6sJizevT;R?WUH%TNI=LR>RQ@{@)a#yS5&c1DV~AJs z^-MV|ig0Nzcq!=AQN)$+R2~LH4425n9-Bpkk$q49MwBwmbs-5C`^B}uz(}xT#r@)w z(6P?spC1(Bxgm;f`U}x`w^lp$p8Fmy1O~L8Cd+D}XpuJWQ&pP7>Gy##*M+Cd!vmqLP-IC%odL3qsvP0rYz)5J#USPH zb}Eq;39Q5gIw!9#0@5q)5e07SLQtJe#tyzCQg(9eyVaQKi#Zn$v!eCDNnD5a@R!gw zb}|kKboTA0f*P}eN&XPhRDzT^$bYpyG9&y+O?gYbNa)+c{CyFr?UC6zdF~sdIcX6Q zgA>O>$>z08IGNM11!l9eopA7Zz+U0b*=YrS)}r|b*a0x$Gxg*2v=u#TkyB38m?&$p zwjX{OM83J@=4pZ$7xp>yk(q$Ql7~5V&-J`?cE$YzsG;t|X@xeap=WQ5^wDNVt+ogO zVsIEvrlMS3aysAZ;a~4QX#T#wdoT z3B6%R$h2_h`LUZlEO~?7xh!7EM?r=alQe$7ahrDCf)c^3?=2Ql&)CsvYlC|MI}DcT z%mVZAfmVQ;=X?>DWj#Dr>x7`7c@x~GEF8EK`GvG6NS9!B$%3Ixx9P2Nsy3~hye667 zU!A_aWZaI=gt?Q}>RXM8l}Hc>8i*~v?c8VF9{fN;f*0iL#Q|;pw=Z$4L*TOQ7aP;O(rvz%O^!Wg ze!Pi6AS+X`%m&8Csd6!UBT)??Ad*t6g~)||3L)CSpI*_kz)xW5rz!|fijj%_%+HqP zWb%HA6yht+DYD=Um$%}d5~=H)(tkl()tlZijRKg5{m@q7TNmR1nKNYS?uRl3NcTm3 z9`$baT@O9*mHZ-*y7X2wdk9!6IhzVFRpOk+X)b-eZ(&-+bARgG5d)#sqP2Z$%)O+O z3(S6`o+dWFJt)Fu`)(3O9N}As$@B9V0q$~*aEk$wK?hpz+et{X1$aXz|N48H?E3Ve zPM3uHNYt@GRPCe8iI!lIXej>?_E0s-Iz@hMiMNkh$6bIx??khGfDcp2Sx-J_J6}L!|H%e1vL6ZAYtN-y zlJZLY3PVeD8+tU$>}m2)p>tZ1`~Wlmc5c1at2k6>q}vGS`aFD%Qf*~~dkT$eRLRL5 zy1uB_OFF<4u{1({dvcH962JDe*ashJn$`}fnKo0@*)|^m+=pnG|Ax?P8GagId76A? zKu9~@4tdP$mRqVzg3wQUZUi(pZd8wWy&XH1zEZ>&W2yD->lCZk_Q?84vT!VHP@96R z7;vgEKHnge92c%9{%D_EM05D3D4Fzz9?;1DDM~?g4qpOqeG=N(uWqQ{g!}&%y#G=F zkdH52;6XaMo1DUUfz4{#Qs5r7;3OyQe6OZP{ORya=7)349nbf|fLHdrR?1#iF)8nF z>qijKRlAYIUFZ5r@zU>k1mIrYAJX-F<>&_ye?3`yy(W5X^`ET~f0?h*e|>^`4f`%h zR^2Q9a(Oc`EdKhil@suw2mDTYooE6e$t*R#M{#Q;k{UUS6CEdi&SIqZEz&>?Z|5A*28;pKiIFJ^Y^dwTGwdoTP6&GBa9toX@OjSyHR-i&w+ z8|w89nyu~v>eg4oOS312F8iRZXqyaK&7Lo-zBKS_cj8fr`#Q{Pzs}~ zwZ>{3gd8Z-ACqsC*9BV_z}Z?n!LO^&KtNvZ7TwSR#gbn44esMLbJu$BJ?`%))W$LCTez31K=5YElNCA{i1{8C3jRU*4MQOLC%!pTWHyhvOL z-j~W6tj}>+1j7lpAY&M_RRc!(#jRlkJ=*kGvOKl z8Qm7NN8rsKy~}4xOD_}^XfE6OSgN+D{K%7PtXEDoFh><ZnTX)mJ5l`T^UPQrQ4zh$7FMpn|nTgGR+ zFx6^MMIi{sB-G*5Oap;EYL*jo_R6CW<9yfHD2hp zG^1}2=|7nU*b+O~p+fd-A$Z9+MO>NX5?l%}H2F!FsWEI1c0c%~to4rY)porb)Do}o zZJqcH#>(&Qkx#2Lgfd_aRA&=;ktcOO@(QfON)HnCd*{M$N1@LRQ1x?!);c7{z#xSy z#X_hCx?Fsg!pkyg{1nJcyD?PiYBrQ#8*e1426? zKO7GEA{vc{WSqP6r96@usQH%1r7grSf{Y`jdv+WFtGc@D@nBG(S`oK{3bzm&#kJ1n{< z>q!TE`d+UQk7>#j!ZYP0R@yAf^%GWm^3(wWWeZst61*uvu@90=)5wU?SB7JlhNA)% zS?)0Y(1zg{u|oCE+)9@)jAqu{ZDqPm_PLqp@S&8zAfF;)7&c{9rG6e&VS9fTE-6Bh~LR9Y@nW)h&K;*LS5O0$<_tPF%F?FKdW(4(U@l8cZ+?%oq! zyrrr|d2oN)T<6B9dssi&XH8S>e9@~yWUA)11{FCbWWiuH_#WfLIktgAa5#DMblPTv zV{Pj@+9JCcePAgGE$^C{j6bPHh z4VEFHu~GitXbl7f2$7fQxAzZ^(&4UYS`Z7i8f(SmcnEK-O_h zKpunaY(+uAFp_|mc~Z}ihgnE(Hf#sphzS-0EEw##ES;jL!S{h5A!XR{v)hJlf&%9p zaven}(Q5c(O}@L#?8&fV^G~zj^RKAHA&xVrRJMV8#=W4k9D87P!25MjE!0~uZffi+ zxZhc^T%wO^{Bj5>45X|>nf$^NV{~USQ&&{yI4VD~8y^O;-7m*<#;+bg?2#rVw^j`U zMcfc$I+y82#h}FH=Q$~mkrKs9rlg1ck)yDw)<#f^n7^hToc+YyS%hlLjW6f?J8^z5 z{)3>;*?>1N6b-z0t8^Vg;~w^&`ovk6arGCSJ1U%FE>|;)XaDQNJUtVI@#o2+h@TE|ReSM&y z)RPe(qS!>@v;j^wPT)sffZuDO%1CC%jCJfDI&NU{!nV%GEq9|^_<_%_DI>?HmA6}} zT%)bfQH)uw!qmmc?b5td;Krhy*qT!Jr$3NjEi>IOd0S>+ASnjv9}T&zsPZ>dzIE(E zyJ)FI*dM?hL5*M`Y1|Zwc%xh7e8w9A&KfEFf9H%Kei+Xd7@^O!OkXP?+F9Tc-SpiB zt%?P@outmh{CV~ROEM*a?c5nEMUy-iyg!jFUn{i0GPXQ#ABA$t4*{{cIaTqg`SDKmwnVuuR7=JN5D6v{Vrzo_tQ?tz~nhws!vHO$0t<~7WevOVC3rnIW& zGOk&DQ8iyEhpdQ<)I<%Brcx9n1!<+@AubBF(xQf$iH+d6x#IT$^Y*U}iU>j^S5ls}RQ1)Dvhh{^L=+uv7iDS?hXb#Jc3h#D$FL2+kx_rM} z3}BLKiN9v+{~|goieGb+Ed+|>3bWEOw#H=2N@s%gKl=NSOdS!)BVMA>&8oz?V}()FG{MBX021_6mZ!x49iE|7IEsZ&eav3b??6y!YI&$3G7{jqx^M>L?^fA&eF;LM z^dW^g53Cm$Z^NcaNG-OLxn!~y{*TXTomThpG@x|DwD=8V9gTO-T%Ka(Oi$DhT&|~Z zbsNc$pwFXKx!qn-cS`W~C6EOe0f-b&%Q&~Ai&ZyrB56u7C_P!Q?xB}Sz?!FY0y&pp ztTo85S?AIc5gl+;MfOV_6+%!)#q{4bqp4}2r)rDWU6R?Mu=-Y+I!nK9Q=?}R7UO2> zPwM99o(>6uRbz>oBi;9C*2CrI(I=wlXDed_fh_AWQ8((s2*p{Axa$w4VZu6pCktko z56PAjen#|2%p5&gMg5RcUQUkQPDFboZ`^>&eW}t|K?_Weg?n0LUk_zqThT>o-Ye^U zh?|BNzFe2@4H3C}oDa(FI1L%LZt}a-P>fDUVdlqu?qA)MX#FF7Q%)?2msM$13y(pE z_r8T0+3bvNwKjLPZ}sUIMqt5iq>J2%_W{HxQXgk&8C?qR%8L+m%z~;#H1kwD8q7}l ztIGNr9UpLRv2u{4%-9&t86pnK%WMijyi*$3K;xxK)uE7L@a|MO63C6?s3Np*a!Q97 zOG^E{d_u?l_XmpK-@r0ASZ2Tpm#)uHmJ|s=lJAKO6zi0~ZQ*irMU3&5pjhM4Da)72 zbx(y15wIQ_C`!;!y&M;oNFtCcUlKzGT=BoS?*m&TbSp}A^NrtRqiG)A^R4RQ&WZg7 z&sH(yoP0En;%~^Ul-@JsX6o7;#0yy&)X~z`Z!K_pzy3$GiO-$=hM`ItaY-)W>E2KASKp|1#>AEE}c9 zEkbsCgU)t^DZ4kMGY@V%r-h>cd`@s0U#)IP&qn`VaOJ*2lM01s!WsIBXC$-oUUQcLJV7NA<3Y9+B|q zu->VJxw1BKTiaI}7A(tiB3iBMg7coy{h%7wyD74n41<$H;EjQH~}6! zR9LP?vgTkD9(2)*xz-EkW7fvi^dQ|gc7XQOpCr1Vmm@@J!{HSbFk;h3x3ZPv5l!X} zTOk$XZ?@9y!In2@9E$M+_&-P||Ixl~3t3`^7#sy}MDG=gSj5m||0a^G|1cFwx4+fq zwt(84>-HT*dP4XYg$8Y1-2(wuEQ>V<>p6}cLL;ir37-sty6)J~!?^Qe5V-|y;i2xc zN%^8H0zDoq{J8$*0=mbG$~TncUS2z^LA|kw&ODv6%g#Nv>Aol`0Mq1Pu;NrR!Bq%M zWnHA7=)A0Rxe6D`z;dQSSfU^}8(c_@<((&2b~IcHu1A@5)%HX=+_g3Ej3A9wOM)R8 z&%iCwPARWM!T2+mG?g;^@r#n)Kn3i$odFDkHOD7$I!EonOXkn5O0K5=qJ9}|(Fe2r zGH&rHp!X?WV75V!1>otuyj1oIGypsHdkYo>WWJ41>}=CX>ZpJtBS|zrS@q=GIyx%qaOlGxKl9Z??A(KR?aR10KIZuhs zQ=dzKN2Z;Ode~yM4wu-m*S^Rqo3S*_&P>qw1kcYJUuAW`&_CQwyzxP1M;HS_Gk6`>{|OTxUyV-{N~qrnB)T7CT}QOcc#` zD@CQx! z=9J*Oa1>3V$XZcIVV)cztFXE~3Bt&5j31IVSrR_GGvHU&+{D`4&17d#`*`?WMzYIY zrOf#wzudqj9CuON?XZWgK4nAYhoA6ET3OWV%M_F3FNFCeFEH}YfscxJBiwc<>O+BQ@cvv;No6vVW-k(S0iGk!gzVQo%xSj`j=I(RK>pj%8lmpPQy@29DTUJ4fI z81D220yOz~_r9%rV5a!bhHz48Vl6cs%#l>;vO5{r+t$XokUL1r^;m3zKQ6>YH!o8?*X#zRyWbT-R2tHaW zc|?t)k=E4FrEG7m^uBbWIQit@&tJS7t6U2A7VjC=@-ipuDO3ri>TAXtNMqVo6u$cN z1D1#mpB$!B4z^k%=Bb+sx6RU}46kbN30B|N`b8w18>5h0c6;3Tz(8A54m5xd)d)>k za^YGG;k1Q5_2zI>uVDLM#jtFL0C$xz<{XZ zf=}siIc~obw;^57ISoojyVquP%T=Kh_d;o+70 z)iKTEdp+s$$&j9PYhopdZHGf}rOpp+3xp~=zA%#O%cQ$->9Cup@$VG^W+yfnrk!+` z1KgyeYu|r9w=;52sl{^8do+j3#(x{mz7Oxb_;DS;G8N@GT5q`}T*l6z4!E8k%58t^ zcSZ3(%vQ0lLa#Q&WaRHmIw&DZaU2;zh4r8QedS$y%D?YX=V+ zjTA|umMd~n;3s94-+lez0$fNYq$ee7Ysj5gk<;DwmZrP>bYunJ1#{^f78iVvU?jrc z1iQ-M=)&B@02fyDP>nyeG*xaMKWbrH?78x;*q7z4b)lKJtqisk6FNyXGd%oi4E?-<02;1crG z#=#nn&R*0EE*RvF0_Mf>jSuNX3dnU0mqWLRNlU1YIUq;z^S+Nif0-&xb=J<4TAYER zIckv7P>kG}M!Y&jZGZoL(RJzQ{@cUrQ}p5E`|?_KZMQEdkKG$}g(3ydU&R&}48Jf@ z+Ac^2*|v-GY@n%*Qaj_P+cDJ0V={g3rZFxM(H}@6Y5Z%xD+idKVaN_RYL*6ksUURqOvOq->0FAj%E(KuA4sv z`WO&({_bPXgofvCsSgN*AVU7%yx7ruf(PwvNJy_$({@o3k5c ziz)WO;;r?^@C~AOY{PyzA84i6&eDF%4I>>l#djC}F2KC&V|!g?v`C1BePp9GH1N;Y zasle_u$0t1R}_$y6VI1>@*S04NKH+W!4cWwx1b3m9 z*R@Hf?G~Atn%g%Iw`Ab;#*q$cN-Rn)DcHWF~uDT4`=pj*&WXh-y-Sm1YWI?YK>B5M#4auoQf1RI>0v6tAL(@7=xBf>{_bD}w8t>PYY2Bw~;1B*AvZQk^VpQ+y=7vAg zYUiW;SDy|M;{q#dw1&2)ecg@a(WTGcDVWo;;Li={priU%k2awk%?Wty_B91_r1vch zWf!?<92*>>J|-DboZs+pX_AfOJ4CbHs>~MGx%GxTK07xqyg5ask(6;vcBFR6yTVs= zWLPjgEI^*=9^OOe?do`Ejmo$#!=&WIygN$!EtqNs<^Np;t!pT0HYXS=Jo66VBtMul z_}SqV(9T^&(ZbMu0GgRc7*s{q{Vtxi+Ut#Aa(;sqxS>dXMTLkW>PqGCnQ5fiJ3sm{ zyu{XzKJDk78dErl6kflEnNdn8+QhRarA)nkoTGZpTXS1qA$i>5DE;hb!OyYgm{Wr^ zgMmg!>VB{mviRzZa6xe}DYu%oA@Nj!`D`QUVygc2hGS5z-ReB7wVQ;02E^!WCf0g> zbbU*40GqO6&N5f?c%FH1+4}7RNf#y0X^;2}+6vgI`j0~LM8K1wef7|i#3H+BG_WzlfkDps0 z$v{-hjF(?dkdH@-OO}UEfJ+9%#ls~d$o;l+@d?Or3Ucy_QHuP3O-v}o#86$`EL_~Y VT&!(Sx%faJ5Goy=thyZP{{lhfT;c!# delta 11639 zcmaiZbx<5z)aN5efWh5^yAC#JaCi4WaCd?a8iH$(;1XN|gu#7q4el^laCceW+uEwF z{r=dl>aKHt_xL^M_U-CA@Qq(B{yTUe7hfVP6*#LZKyidlF=?eEuwP#R5P*9+;3b1e zlwdkT$V1ui<9%tF9Z{p${`$!3s4(zioV<+PW@>i6<0?6R5E&DbA29KBf2jKMbbhJ2 z{<3?in%mK-@z`*~yHqmwGCNXI^wK!82Y%|^>-2xQI_7wJ%6;wq6!>z0xjwC{@YntI z5ggI^pt^pctD5_Iecbudr26pm-1KyB|G564(~|%0N%ryeeD<~^tp@nftNC18V{y>Fz$Mak}fY?DZ0ZLvq^)-g!N4qld zA>R>Pn&fY7L(Y>=k)5lYzmY~moK`$noiG>3y|S9EiR0PjVEu*CP*U=t%o&sYyJ6|c z*a)P!)p%0OYr#(Hm__X;+GQ}+hwfLWQ=66vSCE--=LL?3`6CTU2?z0)9G}NbUQjD# z>>>$Gw|9VL4;t*JI$zYMDaz$V%Gmoh_yFiL%i4ww=b9T3X~1K_r(wi<>wLh)yw#T^ z#L6LIOte@Cu9Lq3ss~(yTLo2A#Gu0}_p*m%gS&YVS0VnEMqjLm2EZ;kh)}L!dT6l- zp`p1Bk?wWNqNziB07^@pr)dLJDwj(xm-~H8m8y<7|A{D}Ko;kT-JUI`9@#WEkq0b) zUW7j;LS|oqQzng)Y2m1I=>c5t{g9z z2|u-y&SXTycLZ<3z%DIyyrK+=jVlY3&qdpGUgo z==#Z)^1-PqN{g$KYHJU~=IMkUMm$ND)~&o?M&{&>`R2KGsFM+|>Frh70A=gG3D878 z0qPLjZnm;eqwwqf5e65gpa%>}TvEgeQ-_O=*~4kL@7VzFPNBlaU9~hq>wM~I_n{)} zGXa4-3wO&@=A)Jvw!+9IF$}S3pL!@MCaGcxPwhsCXc{bH4$Cl(CRKp00ujW}Q1}wP zeF@kKf-Tyf4y%Xl5Y>}=-DEiQCtw;f2^oP*FI0ozGB+H8Y*+xs7NhIz#7(qvFsP>E zD`td^JopXELbQxoH`tdj_t?;w6qs-%D&jg_lprJ_%xNA4rwZP8iMoND(GN*2)3ggX8C~F#Up|i+BUU2uL#_17soC%*e>Q)Z5PC+ zO#$@&fjeG@n2^E|IIcYz%d4Rvy?5o>R|tlOepi7~#csAohqLp1m1QPq(TfMBH&Mo3 zBa=dR564o{WNanyw|=GcR$Z!Wu589 z5*`4p2pcEqtRWrDHoYrkH0w3t>`qxcH*6lRup&ac1Sr`!K}!QhTD0Hqw=98eQGe9k zTVv<}j7JwJh0W?F`~7RgMLT5kIia)~!X#l_&eP_6O1NZ46kp6#%&coC`(p|Bo_p>F zO46z2%;N&0B`f)tu8LX>ICElZ$Xd>480$BJq2J7-`{BJlik@qCNBUs7+tmHj6>Hw* z`_DfxVFb3Xbtr^6ooL?I`H*Tbn#cH5^A8s}bJ33Hh<^ANSreNs&`P5?ztj0Lleupo z*hLDj6}iQA*07eq3sxT4wMHJR*cE<8gKb4*FRj+*PF=fM)|N|Cy9vk!?R|SWD#U~y zO^8;TjlLYQf2Im4I6uqeA4~*dr}#p`_ZsECLaz$khtk(Z-x?c*EeTwwz7b3oXRyBo zjZ~zFaLSrVVi|c`^!(^{Ax!K{%2^?Ksx z#28Xc-y*wwMQ8yM+g}$m^-#68H#IOBJqsiFX~j2^G0+DXW~eboA_m6U1a1&=m6D zK$#kM*c;J*>L%wyP&PB|u`#oo6FNsy>|Id;cxQ>=HqvD^zn9u6|( zNvy@`O$ot^XK}t~f>K;O&f(n7fCTbC3!JE6GAbr z#<~4AD+n1u{lgUxTCJ)%wM#r~FjhHpuPOEKLDG0yY-f2~en&QDqlY(~yWwAPYH8jQ zc{!^2R`&!glj%9ZQ>NR}6(xvgaae&`m4s7GXinOK?c`>fL`n^QA zMShQ>p+GGpP8ngE;8!#GY1IdcrMXp8iqsvZzrbpGaEqDcutD|>-`v|*y- zHa8;NR8FT5rDN1N&bRjs0aRKI8fNTcim}Jr1Imj*u_l6@vw4*LMYL3cSSn4J<|`_5 zK=u$XmmzJjBg;er$?tzRx%!dbHgToibSr~FE0-bMF(~uP-c)dYDvgOj_znaugP;w| z71ksl;=*U=J7Bdl?1s-tuB$xT@Bm0a1{6KB6iy5hacPfzlj@9lGj{!UsAwJ5hlWT) zL-5O10Ua@hL>yfdP`H7%4R6wkrgJnH3dCPc*d@ZXYZ+3MRu|6B)VVS8<`MXC8Pf9V zZwfG9Df$}H|@W0U2li3f?(EQS zmr(c7R}NI-p?WYlQDRpjCu}}GUiCcz5IgYI|K+R#CWRkKMdmep@U$@?dj>pf$O~sR zqx2@e+dAlg?^)jHC~=QY>;CjoK!4Zz@l^)*>HhS;<7L48_5{c4-B4c2^nJv_CG~Z3 zqSnjKz}}~y;OG0D`1O}n4$$T6bBPMl(xq_c*vD5xFzu&->*b5f&g&ol&BrmM&SPVd)e`_s zj-_FrIp{I;5&ZhKvecw|1heZ}Nt_+u_AjEd09dtO{$~Squb8?_NU-3+1d4 zK-@A>k!4CD2q?B z(1T%1dNjx&1wX62A96T}28KwO1Y7C|?J+6N>l^u^;i`*=%1gUP_LbgpnkcMx@1qSF z_(w{ltWDdbR_U5#!LL&0t?<SmK+yzRX& zwEq|tBr%FFg-nx85%ok+;1p4)a|#@f`{y9@MBsjr_Eg2hCMG_`C;$Nk*-guoFf6;T zi{gyBrrD*FqBRc+pG6;(T z0}}2T)jnm~G-(-{Zmai?sP3HPC%IJ|DV3ItB+QKyC8EVJz8+#`dYco|(XRr!DuP)+*)`*_q-5!v zk2$jW1-fQgMNYHy1}g8^E{hFM2R(Txy?n(^*9;o290zUr;-ZPc zpe21J2iWRMyhcXKhOqcPAsc}eao>2qtkZxPe+{XOs9|Br@-=syrG!O(K?>jP7 za$RuP@`~K#xSF_=+F1SYENR2gVdAzJJHMnUq(q9p1q)}I9SnNN_AKI>JXQ1JLL>~Q zrJTe`ts0NE68Ed0<1W*aPr*_0MT(ef-%H}5Hw)pEE0gO1u5O@t#y56)&d9FTlO@?J zpYgdzrz=xN@9ruKl3jx9aqbD?yBseJx{F&DN(8goSO!ydQMIS8)Sa`lXc;b*v{JE` zf(&q0Jug6FQ-~5 z4^YENdr_AgQ-VOkPqUrLAC`#cCnS4YYHm3vIX;aLcAVade&=&Avo6*Pv0dNnht&kn z)miCkD{tmd?!3lL)NqxrR%Y@DsHHM7ejEZ8^XQB6WKQC( z_q=^B}i|yCmXlTX{=Uf7cDGB2X-XR7A)3=Vsj^ z>Y{Xby)m5=&Jx_|QyaThg6J1K@z$_-nx10}_i&Gb5Hr0eKf}LJ; zFu+goDP5d;N7rC;`VKl9;MrS-AjhxIx5i!`Lx5-Aemy^R+Kd*Zbs%kJ=qK zhce8;=NEAT4ve~Tw>kdUyXQfxP!i%F0f2@lY&8~SUg_0+I4@Jot@N86q_AyFKK_H~rRx`byrlFNf zHi-N=m#!vojr@_Gf4>q-n=XFrL8F(QC)z&lhv@+HU0Eq5>{xSY6mbor+K(n0X+F}E zXz%mc%ucTg>rswy?(<*q9o_lq(9i(!K^S!*IKf^!DyE$BS|NpLxPqyVR@{{z3ybhMh6KMq_;+?xDtC7go_4H5=2q za--?|g$-=FJ7Y!1<-?g(3F%zUUKI(fi>g!o8;&IjW#?Mx;_s7PC6!MnaHUAuF*{OINdk>GkV`~eCp{++rY$NLmFot z#P9rt-zu4$>Ddfja`5s~s@wEoWnV9>j^rj%#dCa^o$1L}@IEa$H5!zvl(vF5FfxV+ zjP8RO*!(ws5XLl|1uFpnG6r{I#o<1y$pi%dA~W*m*vxbN{6*s-6NXG2?UZG@-H}5q zOT&M9U?GVFqAsh@wyHZd9d~z{~=agvo+-RW?x7`F(KUOCYKsGy0QN#Yi&sPFL)@1q``1x7$FBmG^ zZ;l7sZjzoDw(1u8;r`aWwKBK^zp1#zYXgl9H=Ln!sMG`5{CgEwH8Ti0Hrq^Y*Qkuq zAnff)CE){r>VWRbAw?kddesiAdsF6!>Um_{j^7XjZUn)d=wdO&TsovYfMKC**W=IzS8CRN4qz6w5JkGf*7G&a9Y(Z zDE~_#aBoyfvR33zgJj1dNo&(BH%#cTh>{KPo1tjvQmwsP&y_QYaPC6b+w6<-A))|P zlqEpg8Xy`?u4Fh`va%E$g^$uZ^g(Y;_Ajo119dLp2azv*AL06S-V@Go_%3buOIRL* zPQ_a*iLOO%3f(y=C}K6iL6F*BSqAahuF?>+-i9<0mOF;ThNzV#VOG+4Y{dkXzGv6~hb(bp~0H<-ccY7KRdwDG9$9T#^)svpti|x$((^X5yllAQ0Wq_;?Y)x(WEq z-NQwicY@B@ja9v~<7;@eevhol6GTRoK$I%Md|eq1PXuvljUY=Izxv zdnxM?3a}G>k0acRtm?Tgk6cpf7YC_sDI(pshnu2x3@J-6eAw`t`Q!v>&&N9IhGgGo zHYY|$BScFQ9C(9dC;#|cl6r{Eceq5bWw%>m=o^O`dszPrkW%~1V)o&H`A6h#ukNov z8FbiKyN9s$tVxRxcz?uA!vfy;Z3nNCQ;92owzBQX5hDN@l${l^oDwbV_#CBl#>VmK9)B{x!$_m1E@ zcjSg#I=i)w0Y|sl!}bE=o@VqrEe;T%SS(}r$C3!rZd0HY0Es|aHlzv(8|}9N9=8s0 zeRL=Hg1)IKDp(}}STu`+oW@j!WG4}{K1E|{n8Oeos!6Zi*XJNn?BT=(Kq;HYrax-MfWw%PjX)8 zfd|$-Zxt^+6@9-r1tK;K@j7~(VAM5F7}`b)zGhd4yb~YMd6~=?m^JRluqZ+mQ5-Ey z@Ask&dy7j*Zw9T!KR1L}j_3p%nm2N*#W1?@;uVwfhSo%wHN9VFK7*yS2+U?p_d=FA zT}Hvd->LM6nY0jyfhq$>hQ#_sS$Ihxr$AvOIV8GmZv=Vg4&D+XYU1EF^9yX^?!uw7O%XQ_j=q9w4silA+4HxsO=;j2FfPMjVgYNHZq<(SnGUX=H0gdqOBQPa zt#Wrcp|eGl+V0yztl;kXyNmwAw%JM4^a`-25`%6nB8xTRbTt8I#9!~PzFloqX@Mib z)`-RLih@SnlseVKd?e5)9^Uz3XJqoKj`dSNs8UW^#d8Xx;wZ!DW(NPn^l`rAjmkk~` zWUBrZtPjptvin<~r2&V&r`I`9d^Gx*zmNTXLf{1KGvK9ks-J*zeOq%GT<9)bU^4Rg zb>m4r;m^>@oV1ZomX;Nqc<9UTa9uchuyiBh_K`G1XMhfZ>^tryuKgx8HfM<)4?r>$RY( zNyT{vOe)8NOG`g*rN87;lkU7&er2|pd`;t1^HFSKDhL5@1ij$cqbWqx?p&n$$!Sw|{rq^Uj2W#>q?=wpuywIQy;RV?G%SHKI zXxKz5>P&w7898+<>WP(XY-v;(nC0$^den9G>dXL3$Sk?!FA4Q#ZOXAtsIatKb1>qv z#-Tt%aC2i+Se~VjQob|S=_u!m5CWSQN~nJUl1U&#JL(DAvUnrEl#v!^V95g_u4|-$ zD0(u4mL>0Jsz*G=SaxNZ_!OAg!n|9P-{NN~UOmP!?cu(m7^i?Cc{muAS5BRvRNNez z>QHLr+2?@XDxn+c2lchXzdu3R@;2rTJJM#CXEe4-rab-wAL5#BXxuLiLLuqVG(&!n z^zHw89O-VSVLJ&oUUrrjJO>fIrZJm!UB5P+l~T!v6oLC34&CTk(f%6gXSqosEOT|pgfWv}=gLGbccr+N<9a&8 zow-RFy40?icOk-IHj1&1a%&hNPtAjJD&K)k zM%||PQIoT1%wz?r6N3tX(iJ@Z4+DaqDR+-Goy8F}Sp`d*1+Tr4>;$fmr?gMagmGA- zZUhJxUPPED(48(tF4%O2w6VzvZPdMfNp#FI+MwLeDV1QwN!pq3hamMFtidD(&#qho zLP_oXwAw#6eQ&}Ukr-FPub5C1524^`nBCKTYs{o-l_NPcgD2o$u3nTbfuboZ1h zgorcDsWg*Bpvw|}!o0Yk`a-;7zs948q4c4>AKuB<9CRCBHAY!p5>A1iAsAtq6+nj} zM3GR%rT)TP1gz##taSW?2(aIWsz+(WQ91&%2M-Ch4Ya}S#kHtb@z{d$KDy{xC_iN} zF_o?jQnFkJH9F$ct8$rv81tHZ(0+#*s2aEYlvTW-f7ZK}sPgg!Jw|Jz1E^62bC~Cj zf#&{Jy1mHvp}}0>cBW(-h0rt_U&cGdP-e7bM0wF)O%qJRne9$0uG-~31hYmucmjs; zZb;JOIZ|K`xuee~qF2<$&qkl07lN}L`_yfbB{w-G;=gFec*(U7fUc*9hkgptf)!#H z6}GNHo4(#%!v3Y#rDps2J2dxR~JwX1jnc0VceC)q;J!K19DoJZz5D%LYOrVJEeYi*abAH-TQCAIOV!Q$13jAb`b)e z-_@K~D!vFIbTnRb03xIwhCJ`ik*7X~JX3um9}nnub^gs}|;Qh@j?}UdRx2&0p!n7YD>cxn@fCl++ zn`E^Q`MuF1tR3abMr~RKAoo57dNCddn2vEQOp~4v9vvbM_;(AKxQM(Hh+wXiCi1Y&JkSiQ%eHhb55yS z|43#kX(KZrKOPknz8+fpxHDGE-?LQfEPx1>v6q0!rQ=MzY^J+9L3T_7@r(g8VH{d zgLfgS?;um$(Slts$<%hjjsEfwLGtf6rGlFjq#9;rX2D{+!Ye^m6h4k^5I4V26R1TF zZotu{_ZPbqog*zePzNr6sdiM^P(bDRB8q&NCWe$RGE395QdNKes0)5jClkDUgcn$Q zhjZMa6j$igk6q1C*exqnOMD-U{bX=rP*t-)u3vjb_>2(clC_#>Iwv#lQXJ!-<#n=@ zBn)eU=gn+fuaEW{bDc@Xj}Y%ovxdg0^Cf&zz})QBSoFYSjTG!1vTzX0ybfk90LPHJ zX^KF8jyu}B?!qZ}AcApu>!_Rt$_RcB?}CP(t3FRAh#7A3ai`)d?jTDkRNc6!Sc;`mj(2$!{C`uLx9aZZ7=JFocHjQUt#k5noKMV?^#_fvASmW|<}Ocd z(pQDF&WNq<&p(zjUnpaBbtZY#Nc`Z^$QrUs?(OHUQ9nESuQ7NbWWGNWWaYLIDN#Ob zmtHuj8P5x^g`Z&o7d3!#{}iwxj)i)>6a9@lkbPs5IK!eg1Oo_)0}8e@E!qGT290ul z_(!tWg47R`aKK^{fw02DuGj5)SSh9aF}|o3*%y-Xh>ilWe>TDebdG$y3|vz70<-8z zj60gd3`;qB{Pza%jsj(hngD$ByTp%U+0cf^jKn#nZ#0+S#q0?-n#oROvl9J@o9htM z9&KZ~>bJVy2rf^VNP2~_6pTt)1B0y()9||% z1xceXV#&?X&jU>>OPB!LJ(j(pIhmj%L;F#=!=8Eia3O29R-F)BK2n z5NuE^pOc=-ejtTg4du{5s9NPe0S}zli7rv82s7l?AZK=VeZ9yV`XcLhlvmg8Zy2py zizpO9KA+dG2xsM6Z)$LUGy5D`&U$@Ec-Fb_MDc;`FW_`93#_!^{vr_cCx2FG?KJAr zc`jKK{Ajvf55s9_6>u78p>Sir`z5#WBWN}U!uk7o;C=}qwelYxk%i5>=lw3Gz>lOA zrP9)2JGYuWxgn<4FXKV!bU1x$xO-vei3MRHkI`-q{16tj!n^jzXq?5tcf_d0!3pS) z&Ow&7*Wyml4msbmw3JE;I^#CO8u3fGk+!rn7{KaUXeYVEI9<#EF{So^TSk7|dO6Y( z2`WZ?=(uY~6i*;$H%n7LdKtt*dbY>#i|6nP?9d7wj#9kjKWuM(T{Am9?~l4K-&)42 zrXa9Ir)H?xNPR90@cQ^fJAX^#o=%hQ%?P`^JSP=SA6dG8Vdi_++SU=_&|@h@<7KJ~ zmdk11*Nnf+EF9ePqsE8l&&Bl9N=Qim2ydA8eIoEwfmk*70F2KPME1^jfC97s{R^3& z@Qi!PV%|{!c?L3Y_U@a>AdfJGUAOie&bf5#B~=G9SdRr3eKiOa?kHvPz(5W=i|tA9 z^!8V$v8c|YY%3*ywD0fRI{$7#h1@6te))0!xOu+#?Ii-ynSJk%G~?ywdG-Sfi0gI9 zBJSWr0-|Fqcu2}E!Lw1l*wca`_E+{de&#EM1pNMc%HH9Ldx=!sy--#s=~ppIWa3eR zuMWi@_mYpQB1UP|i<1hCA_oYW1|APn8h_rA;C6&J7Z#}@hBm>Xh46*fD5mtWz}F{d zjw`p#_KabaKh0IrCP%%nPMK41SXLDhdt#iT`UeCHk-sDAvR<(?R8&ZRFh~3$m!Y+v zrW=PHn?20pq4~75gVxVZ{%FKyg!P`H9o~|cyMszA;vmiCq+IjUXp#i-CyE~>QD}`g z*NNL&HUu*h#T}I2C6grCI0?t_eZfz};+)a+?wxUzH7UUKQVbLpQ^8D^u3 ztABFiX|-K)%=;Aer^$;zOJ}Ry6rGlJb$Hgrf_VeZ?*ZO@Q<2VeH(02Nuc}lu1yOAg zK*noL4s>9_jlV{z#$k40MPBuS+|=RkBx<{xsv!FlI!ZPHLz&*Hx9`&w21c@rT_7v_y+m@9WS7$d zfD>4l9naqYx$F{dn}Rn2v+X_=eO6Vk0ENZg`&#Z*$nh%V%^rTg#`BFxP`at8dbq#v z#m*vRTQ5Wa39#UZRr1iGRy`}yFdG`E|Uad=}--62~BcIT|QbZ%TLV?@mB9KB-Z zQl8N%_3~#Bg6%I}9r37N%i;yk=4YTUVwOAbu$!%i!vG(#XH(O1F|&tOjmM81R4?3^ zJaxq;$Q6OtVx9VlSvRC(FYuXCm``Q08<=`kc)TZn4zUB{}bc?pM2b0y#FsBFBkWJd|3atVO}l}Pcq&|(N8kmygVQtDG3PyaRF{= zUZ4QKI2Z73Qt}Prlo1DtQVRcnS&S(~MN!>7%-lS@-7Kw8xp;Yacv0!-q*Y~5{}-!n BYHk1k diff --git a/_freeze/03_asymptotics/figure-pdf/fig-std-normal-1.pdf b/_freeze/03_asymptotics/figure-pdf/fig-std-normal-1.pdf index 4a8cb64ba7c3ce228ba4bd5d11267f55ab3c22bd..5c9cb85a1ccceb27844eb31bca7fd71c490763c0 100644 GIT binary patch delta 171 zcmbPeH_>jxS3VUZGa~~-3u6;aE`8tp6qm%3R0RzeDjxS3VU3Gec7Y6Jt|NE`8tp6qm%3R0RzeD+AE>6xS#?GdeZkEnw&TdAA&MwA|=5`7;1eL@p*x7Lv Zmn0UIR1~GAahaPMnsceDy863u0RZh5D}Vq1 diff --git a/_freeze/04_hypothesis_tests/execute-results/html.json b/_freeze/04_hypothesis_tests/execute-results/html.json index ee9dd63..a8e322b 100644 --- a/_freeze/04_hypothesis_tests/execute-results/html.json +++ b/_freeze/04_hypothesis_tests/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "55f1ee4b011ed2824a645cec73fb7784", + "hash": "a2384cf9d1d8a813b5e24a487bf33def", "result": { - "markdown": "# Hypothesis tests\n\n\nUp to now, we have discussed the properties of estimators that allow us to characterize their distributions in finite and large samples. These properties might let us say that, for example, our estimated difference in means is equal to a true average treatment effect on average across repeated samples or that it will converge to the true value in large samples. These properties, however, are properties of repeated samples. As researchers, we will only have access to a single sample. **Statistical inference** is the process of using our single sample to learn about population parameters. Several ways to conduct inference are connected, but one of the most ubiquitous in the sciences is the hypothesis test, which is a kind of statistical thought experiment. \n\n\n\n\n## The lady tasting tea\n\nThe lady tasting tea exemplifies the core ideas behind hypothesis testing due to R.A. Fisher.[^1] Fisher had prepared tea for his colleague, the algologist Muriel Bristol. Knowing that she preferred milk in her tea, he poured milk into a tea cup and then poured the hot tea into the milk. Bristol rejected the cup, stating that she preferred pouring the tea first, then milk. Fisher was skeptical at the idea anyone could tell the difference between a cup poured milk-first or tea-first. So he and another colleague, William Roach, devised a test to see if Bristol could distinguish the two preparation methods. \n\nFisher and Roach prepared 8 cups of tea, four milk-first and four tea-first. They then presented the cups to Bristol in a random order (though she knew there were 4 of each type), and she proceeded to identify all of the cups correctly. At first glance, this seems like good evidence that she can tell the difference between the two types, but a skeptic like Fisher raised the question: \"could she have just been randomly guessing and got lucky?\" This led Fisher to a **statistical thought experiment**: what would the probability of guessing the correct cups be *if* she were guessing randomly?\n\nTo calculate the probability of Bristol's achievement, we can note that \"randomly guessing\" here would mean that she was selecting a group of 4 cups to be labeled milk-first from the 8 cups available. Using basic combinatorics, we can calculate there are 70 ways to choose 4 cups among 8, but only 1 of those arrangements would be correct. Thus, if randomly guessing means choosing among those 70 options with equal chance, then the probability of guessing the right set of cups is 1/70 or $\\approx 0.014$. The low probability implies that the hypothesis of random guessing may be implausible. \n\nThe story of the lady tasting tea encapsulates many of the core elements of hypothesis testing. Hypothesis testing is about taking our observed estimate (Bristol guessing all the cups correctly) and seeing how likely that observed estimate would be under some assumption or hypothesis about the data-generating process (Bristol was randomly guessing). When the observed estimate is unlikely under the maintained hypothesis, we might view this as evidence against that hypothesis. Thus, hypothesis tests help us assess evidence for particular guesses about the DGP. \n\n\n[^1]: The analysis here largely comes from @Senn12. \n\n\n::: {.callout-note}\n\n## Notation alert\n\nFor the rest of this chapter, we'll introduce the concepts following the notation in the past chapters. We'll usually assume that we have a random (iid) sample of random variables $X_1, \\ldots, X_n$ from a distribution, $F$. We'll focus on estimating some parameter, $\\theta$ of this distribution (like the mean, median, variance, etc.). We'll refer to $\\Theta$ as the set of possible values of $\\theta$ or the **parameter space**.\n\n:::\n\n## Hypotheses\n\nIn the context of hypothesis testing, hypotheses are just statements about the population distribution. In particular, we will make statements that $\\theta = \\theta_0$ where $\\theta_0 \\in \\Theta$ is the hypothesized value of $\\theta$. Hypotheses are ubiquitous in empirical work, but here are some examples to give you a flavor:\n\n- The population proportion of US citizens that identify as Democrats is 0.33. \n- The population difference in average voter turnout between households who received get-out-the-vote mailers vs. those who did not is 0. \n- The difference in the average incidence of human rights abuse in countries that signed a human rights treaty vs. those countries that did not sign is 0. \n\nEach of these is a statement about the true DGP. The latter two are very common: when $\\theta$ represents the difference in means between two groups, then $\\theta = 0$ is the hypothesis of no actual difference in population means or no treatment effect (if the causal effect is identified). \n\nThe goal of hypothesis testing is to adjudicate between two complementary hypotheses. \n\n::: {#def-null}\n\nThe two hypotheses in a hypothesis test are called the **null hypothesis** and the **alternative hypothesis**, denoted as $H_0$ and $H_1$, respectively. \n\n:::\n\nThese hypotheses are complementary, so if the null hypothesis $H_0: \\theta \\in \\Theta_0$, then the alternative hypothesis is $H_1: \\theta \\in \\Theta_0^c$. The \"null\" in null hypothesis might seem odd until you realize that most null hypotheses are that there is no effect of some treatment or no difference in means. For example, suppose $\\theta$ is the difference in mean support for expanding legal immigration between a treatment group that received a pro-immigrant message and some facts about immigration and a control group that just received the factual information. Then, the typical null hypothesis would be no difference in means or $H_0: \\theta = 0$, and the alternative would be $H_1: \\theta \\neq 0$. \n\nThere are two types of tests that differ in the form of their null and alternative hypotheses. A **two-sided test** is of the form\n$$\nH_0: \\theta = \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta \\neq \\theta_0,\n$$\nwhere the \"two-sided\" part refers to how the alternative contains values of $\\theta$ above and below the null value $\\theta_0$. A **one-sided test** has the form\n$$\nH_0: \\theta \\leq \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta > \\theta_0,\n$$\nor\n$$\nH_0: \\theta \\geq \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta < \\theta_0.\n$$\nTwo-sided tests are much more common in the social sciences, where we want to know if there is any evidence, positive or negative, against the presumption of no treatment effect or no relationship between two variables. One-sided tests are for situations where we only want evidence in one direction, which is rarely relevant to social science research. One-sided tests also have the downside of being misused to inflate the strength of evidence against the null and should be avoided. Unfortunately, the math of two-sided tests is also more complicated. \n\n## The procedure of hypothesis testing\n\nAt the most basic level, a **hypothesis test** is a rule that specifies values of the sample data for which we will decide to **reject** the null hypothesis. Let $\\mathcal{X}_n$ be the range of the sample---that is, all possible vectors $(x_1, \\ldots, x_n)$ that have positive probability of occurring. Then, a hypothesis test describes a region of this space, $R \\subset \\mathcal{X}_n$, called the **rejection region** where when $(X_1, \\ldots, X_n) \\in R$ we will **reject** $H_0$ and when the data is outside this region, $(X_1, \\ldots, X_n) \\notin R$ we **retain**, **accept**, or **fail to reject** the null hypothesis.[^2]\n\n[^2]: Different people and different textbooks describe what to do when do not reject the null hypothesis in different ways. The terminology is not so important so long as you understand that rejecting the null does not mean the null is logically false, and \"accepting\" the null does not mean the null is logically true. \n\n\n\n\nHow do we decide what the rejection region should be? Even though we define the rejection region in terms of the **sample space**, $\\mathcal{X}_n$, it's unwieldy to work with the entire vector of data. Instead, we often formulate the rejection region in terms of a **test statistic**, $T = T(X_1, \\ldots, X_n)$, where the rejection region becomes\n$$\nR = \\left\\{(x_1, \\ldots, x_n) : T(x_1, \\ldots, x_n) > c\\right\\},\n$$\nwhere $c$ is called the **critical value**. This expression says that the rejection region is the part of the sample space that makes the test statistic sufficiently large. We reject null hypotheses when the observed data is incompatible with those hypotheses, where the test statistic should be a measure of this incompatibility. Note that the test statistic is a random variable and has a distribution---we will exploit this to understand the different properties of a hypothesis test. \n\n\n\n::: {#exm-biden}\n\nSuppose that $(X_1, \\ldots, X_n)$ represents a sample of US citizens where $X_i = 1$ indicates support for the current US president and $X_i = 0$ means no support. We might be interested in the test of the null hypothesis that the president does not have the support of a majority of American citizens. Let $\\mu = \\E[X_i] = \\P(X_i = 1)$. Then, a one-sided test would compare the two hypotheses:\n$$ \nH_0: \\mu \\leq 0.5 \\quad\\text{versus}\\quad H_1: \\mu > 0.5.\n$$\nIn this case, we might use the sample mean as the test statistic, so that $T(X_1, \\ldots, X_n) = \\Xbar_n$ and we have to find some threshold above 0.5 such that we would reject the null, \n$$ \nR = \\left\\{(x_1, \\ldots, x_n): \\Xbar_n > c\\right\\}.\n$$\nIn words, how much support should we see for the current president before we reject the notion that they lack majority support? Below we will select the critical value, $c$, to have beneficial statistical properties. \n:::\n\nThe structure of a reject region will depend on whether a test is one- or two-sided. One-sided tests will take the form $T > c$, whereas two-sided tests will take the form $|T| > c$ since we want to count deviations from either side of the null hypothesis as evidence against that null. \n\n## Testing errors\n\nHypothesis tests end with a decision to reject the null hypothesis or not, but this might be an incorrect decision. In particular, there are two ways to make errors and two ways to be correct in this setting, as shown in @tbl-errors. The labels are confusing, but it's helpful to remember that **type I errors** (said \"type one\") are labeled so because they are the worse of the two types of errors. These errors occur when we reject a null (say there is a true treatment effect or relationship) when the null is true (there is no true treatment effect or relationship). Type I errors are what we see in the replication crisis: lots of \"significant\" effects that turn out later to be null. **Type II errors** (said \"type two\") are considered less problematic: there is a true relationship, but we cannot detect it with our test (we cannot reject the null). \n\n\n| | $H_0$ True | $H_0$ False |\n|--------------|--------------|---------------|\n| Retain $H_0$ | Awesome | Type II error |\n| Reject $H_0$ | Type I error | Great |\n\n: Typology of testing errors {#tbl-errors}\n\n\nIdeally, we would minimize the chances of making either a type I or type II error. Unfortunately, because the test statistic is a random variable, we cannot remove the probability of an error altogether. Instead, we will derive tests with some guaranteed performance to minimize the probability of type I error. To derive this, we can define the **power function** of a test,\n$$ \n\\pi(\\theta) = \\P\\left( \\text{Reject } H_0 \\mid \\theta \\right) = \\P\\left( T \\in R \\mid \\theta \\right),\n$$\nwhich is the probability of rejection as a function of the parameter of interest, $\\theta$. The power function tells us, for example, how likely we are to reject the null of no treatment effect as we vary the actual size of the treatment effect. \n\nWe can define the probability of type I error from the power function. \n\n::: {#def-size}\nThe **size** of a hypothesis test with the null hypothesis $H_0: \\theta = \\theta_0$ is \n$$ \n\\pi(\\theta_0) = \\P\\left( \\text{Reject } H_0 \\mid \\theta_0 \\right).\n$$\n:::\n\nYou can think of the size of a test as the rate of false positives (or false discoveries) produced by the test. @fig-size-power shows an example of rejection regions, size, and power for a one-sided test. In the left panel, we have the distribution of the test statistic under the null, with $H_0: \\theta = \\theta_0$, and the rejection region is defined by values $T > c$. The shaded grey region is the probability of rejection under this null hypothesis or the size of the test. Sometimes, we will get extreme samples by random chance, even under the null, leading to false discoveries.[^3]\n\n[^3]: Eagle-eyed readers will notice that the null tested here is a point, while we previously defined the null in a one-sided test as a region $H_0: \\theta \\leq \\theta_0$. Technically, the size of the test will vary based on which of these nulls we pick. In this example, notice that any null to the left of $\\theta_0$ will result in a lower size. And so, the null at the boundary, $\\theta_0$, will maximize the size of the test, making it the most \"conservative\" null to investigate. Technically, we should define the size of a test as $\\alpha = \\sup_{\\theta \\in \\Theta_0} \\pi(\\theta)$. \n\nIn the right panel, we overlay the distribution of the test statistic under one particular alternative, $\\theta = \\theta_1 > \\theta_0$. The red-shaded region is the probability of rejecting the null when this alternative is true or the power---it's the probability of correctly rejecting the null when it is false. Intuitively, we can see that alternatives that produce test statistics closer to the rejection region will have higher power. This makes sense: detecting big deviations from the null should be easier than detecting minor ones. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Size of a test and power against an alternative.](04_hypothesis_tests_files/figure-html/fig-size-power-1.png){#fig-size-power width=672}\n:::\n:::\n\n\n\n@fig-size-power also hints at a tradeoff between size and power. Notice that we could make the size smaller (lower the false positive rate) by increasing the critical value to $c' > c$. This would make the probability of being in the rejection region smaller, $\\P(T > c' \\mid \\theta_0) < \\P(T > c \\mid \\theta_0)$, leading to a lower-sized test. Unfortunately, it would also reduce power in the right panel since the probability of being in the rejection region will be lower under any alternative, $\\P(T > c' \\mid \\theta_1) < \\P(T > c \\mid \\theta_1)$. This means we usually cannot simultaneously reduce both types of errors. \n\n## Determining the rejection region\n\n\nIf we cannot simultaneously optimize a test's size and power, how should we determine where the reject region is? That is, how should we decide what empirical evidence will be strong enough for us to reject the null? The standard approach to this problem in hypothesis testing is to control the size of a test (that is, control the rate of false positives) and try to maximize the power of the test subject to that constraint. So we say, \"I'm willing to accept at most x%\" of findings will be false positives and do whatever we can to maximize power subject to that constraint. \n\n::: {#def-level}\n\nA test has **significance level** $\\alpha$ if its size is less than or equal to $\\alpha$, or $\\pi(\\theta_0) \\leq \\alpha$.\n\n:::\n\nA test with a significance level of $\\alpha = 0.05$ will have a false positive/type I error rate no larger than 0.05. This level is widespread in the social sciences, though you also will $\\alpha = 0.01$ or $\\alpha = 0.1$. Frequentists justify this by saying this means that with $\\alpha = 0.05$, there will only be 5% of studies that will produce false discoveries. \n\nOur task is to construct the rejection region so that the **null distribution** of the test statistic $G_0(t) = \\P(T \\leq t \\mid \\theta_0)$ has less than $\\alpha$ probability in that region. One-sided tests like in @fig-size-power are the easiest to show, even though we warned you not to use them. We want to choose $c$ that puts no more than $\\alpha$ probability in the tail, or\n$$ \n\\P(T > c \\mid \\theta_0) = 1 - G_0(c) \\leq \\alpha.\n$$\nRemembering that the smaller the value of $c$ we can use will maximize power, which implies that the critical value for the maximum power while maintaining the significance level is when $1 - G_0(c) = \\alpha$. We can use the **quantile function** of the null distribution to find the exact value of $c$ we need,\n$$\nc = G^{-1}_0(1 - \\alpha),\n$$\nwhich is just fancy math to say, \"the value at which $1-\\alpha$ of the null distribution is below.\"\n\nThe determination of the rejection region follows the same principles for two-sided tests, but it is slightly more complicated because we reject when the magnitude of the test statistic is large, $|T| > c$. @fig-two-sided shows that basic setup. Notice that because there are two (disjoint) regions, we can write the size (false positive rate) as\n$$ \n\\pi(\\theta_0) = G_0(-c) + 1 - G_0(c)\n$$\nIn most cases that we will see, the null distribution for such a test will be symmetric around 0 (usually asymptotically standard normal, actually), which means that $G_0(-c) = 1 - G_0(c)$, which implies that the size is\n$$ \n\\pi(\\theta_0) = 2(1 - G_0(c)).\n$$ \nSolving for the critical value that would make this $\\alpha$ gives\n$$ \nc = G^{-1}_0(1 - \\alpha/2).\n$$\nAgain, this formula can seem dense, but remember what you are doing: finding the value that puts $\\alpha/2$ of the probability of the null distribution in each tail. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Rejection regions for a two-sided test.](04_hypothesis_tests_files/figure-html/fig-two-sided-1.png){#fig-two-sided width=672}\n:::\n:::\n\n\n\n## Hypothesis tests of the sample mean\n\nLet's go through an extended example about hypothesis testing of a sample mean, sometimes called a **one-sample test**. Let's say $X_i$ are feeling thermometer scores about \"liberals\" as a group on a scale of 0 to 100, with values closer to 0 indicating cooler feelings about liberals and values closer to 100 indicating warmer feelings about liberals. We want to know if the population average differs from a neutral value of 50. We can write this two-sided test as\n$$\nH_0: \\mu = 50 \\quad\\text{versus}\\quad H_1: \\mu \\neq 50,\n$$\nwhere $\\mu = \\E[X_i]$. The standard test statistic for this type of test is the so-called **t-statistic**, \n$$ \nT = \\frac{\\left( \\Xbar_n - \\mu_0 \\right)}{\\sqrt{s^2 / n}} =\\frac{\\left( \\Xbar_n - 50 \\right)}{\\sqrt{s^2 / n}},\n$$\nwhere $\\mu_0$ is the null value of interest and $s^2$ is the sample variance. If the null hypothesis is true, then by the CLT, we know that the t-statistic is asymptotically normal, $T \\indist \\N(0, 1)$. Thus, we can approximate the null distribution with the standard normal!\n\nLet's create a test with level $\\alpha = 0.05$. Then we need to find the rejection region that puts $0.05$ probability in the tails of the null distribution, which we just saw was $\\N(0,1)$. Let $\\Phi()$ be the CDF for the standard normal and let $\\Phi^{-1}()$ be the quantile function for the standard normal. Drawing on what we developed above, you can find the value $c$ so that $\\P(|T| > c \\mid \\mu_0)$ is 0.05 with\n$$\nc = \\Phi^{-1}(1 - 0.05/2) \\approx 1.96,\n$$\nwhich means that a test where we reject when $|T| > 1.96$ would have a level of 0.05 asymptotically. \n\n\n## The Wald test\n\nWe can generalize the hypothesis test for the sample mean to estimators more broadly. Let $\\widehat{\\theta}_n$ be an estimator for some parameter $\\theta$ and let $\\widehat{\\textsf{se}}[\\widehat{\\theta}_n]$ be a consistent estimate of the standard error of the estimator, $\\textsf{se}[\\widehat{\\theta}_n] = \\sqrt{\\V[\\widehat{\\theta}_n]}$. We consider the two-sided test\n$$\nH_0: \\theta = \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta \\neq \\theta_0.\n$$\n\nIn many cases, our estimators will be asymptotically normal by a version of the CLT so that under the null hypothesis, we have\n$$ \nT = \\frac{\\widehat{\\theta}_n - \\theta_0}{\\widehat{\\textsf{se}}[\\widehat{\\theta}_n]} \\indist \\N(0, 1). \n$$\nThe **Wald test** rejects $H_0$ when $|T| > z_{\\alpha/2}$, where $z_{\\alpha/2}$ that puts $\\alpha/2$ in the upper tail of the standard normal. That is, if $Z \\sim \\N(0, 1)$, then $z_{\\alpha/2}$ satisfies $\\P(Z \\geq z_{\\alpha/2}) = \\alpha/2$. \n\n::: {.callout-note}\n\nIn R, you can find the $z_{\\alpha/2}$ values easily with the `qnorm()` function:\n\n::: {.cell}\n\n```{.r .cell-code}\nqnorm(0.05 / 2, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.959964\n```\n:::\n:::\n\n\n:::\n\n::: {#thm-wald}\nAsymptotically, the Wald test has size $\\alpha$ such that\n$$ \n\\P(|T| > z_{\\alpha/2} \\mid \\theta_0) \\to \\alpha.\n$$\n\n:::\n\nThis result is very general, and it means that many, many hypothesis tests based on estimators will have the same form. The main difference across estimators will be how we calculate the estimated standard error. \n\n::: {#exm-two-props}\n\n## Difference in proportions\n\nIn get-out-the-vote (GOTV) experiments, we might randomly assign a group of citizens to receive mailers encouraging them to vote, whereas a control group receives no message. We'll define the turnout variables in the treatment group $Y_{1}, Y_{2}, \\ldots, Y_{n_t}$ as iid draws from a Bernoulli distribution with success $p_t$, which represents the population turnout rate among treated citizens. The outcomes in the control group $X_{1}, X_{2}, \\ldots, X_{n_c}$ are iid draws from another Bernoulli distribution with success $p_c$, which represents the population turnout rate among citizens not receiving a mailer. \n\n\nOur goal is to learn about the treatment effect of this treatment on whether or not the citizen votes, $\\tau = p_t - p_c$, and we will use the sample difference in means/proportions as our estimator, $\\widehat{\\tau} = \\Ybar - \\Xbar$. To perform a Wald test, we need to know/estimate the standard error of this estimator. Notice that because these are independent samples, the variance is\n$$ \n\\V[\\widehat{\\tau}_n] = \\V[\\Ybar - \\Xbar] = \\V[\\Ybar] + \\V[\\Xbar] = \\frac{p_t(1-p_t)}{n_t} + \\frac{p_c(1-p_c)}{n_c},\n$$\nwhere the third equality comes from the fact that the underlying outcome variables $Y_i$ and $X_j$ are binary. Obviously, we do not know the true population proportions $p_t$ and $p_c$ (that's why we're doing the test!), but we can estimate the standard error by replacing them with their estimates\n$$ \n\\widehat{\\textsf{se}}[\\widehat{\\tau}] = \\sqrt{\\frac{\\Ybar(1 -\\Ybar)}{n_t} + \\frac{\\Xbar(1-\\Xbar)}{n_c}}.\n$$\n\nThe typical null hypothesis test, in this case, is \"no treatment effect\" vs. \"some treatment effect\" or\n$$\nH_0: \\tau = p_t - p_c = 0 \\quad\\text{versus}\\quad H_1: \\tau \\neq 0,\n$$\nwhich gives the following test statistic for the Wald test\n$$\nT = \\frac{\\Ybar - \\Xbar}{\\sqrt{\\frac{\\Ybar(1 -\\Ybar)}{n_t} + \\frac{\\Xbar(1-\\Xbar)}{n_c}}}. \n$$\nIf we wanted a test with level $\\alpha = 0.01$, we would reject the null when $|T| > 2.58$ since\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqnorm(0.01/2, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2.575829\n```\n:::\n:::\n\n\n:::\n\n\n::: {#exm-diff-in-means}\n\n## Difference in means\n\nLet's take a similar setting to the last example with randomly assigned treatment and control groups, but now the treatment is an appeal for donations, and the outcomes are continuous measures of how much a person donated to the political campaign. Now the treatment data $Y_1, \\ldots, Y_{n_t}$ are iid draws from a population with mean $\\mu_t = \\E[Y_i]$ and population variance $\\sigma^2_t = \\V[Y_i]$. The control data $X_1, \\ldots, X_{n_c}$ are iid draws (independent of the $Y_i$) from a population with mean $\\mu_c = \\E[X_i]$ and population variance $\\sigma^2_c = \\V[X_i]$. The parameter of interest is similar to before: the population difference in means, $\\tau = \\mu_t - \\mu_c$, and we'll form the usual hypothesis test of\n$$ \nH_0: \\tau = \\mu_t - \\mu_c = 0 \\quad\\text{versus}\\quad H_1: \\tau \\neq 0.\n$$\n\nThe only difference between this setting and the difference in proportions is the standard error here will be different because we cannot rely on the Bernoulli. Instead, we'll use our knowledge of the sampling variance of the sample means and independence between the samples to derive \n$$\n\\V[\\widehat{\\tau}] = \\V[\\Ybar] + \\V[\\Xbar] = \\frac{\\sigma^2_t}{n_t} + \\frac{\\sigma^2_c}{n_c},\n$$\nwhere we can come up with an estimate of the unknown population variance with sample variances\n$$\n\\widehat{\\se}[\\widehat{\\tau}] = \\sqrt{\\frac{s^2_t}{n_t} + \\frac{s^2_c}{n_c}}.\n$$\nWe can use this estimator to derive the Wald test statistic of \n$$ \nT = \\frac{\\widehat{\\tau} - 0}{\\widehat{\\se}[\\widehat{\\tau}]} = \\frac{\\Ybar - \\Xbar}{\\sqrt{\\frac{s^2_t}{n_t} + \\frac{s^2_c}{n_c}}},\n$$\nand if we want an asymptotically level of 0.05, we can reject when $|T| > 1.96$.\n:::\n\n\n## p-values\n\nThe hypothesis testing framework focuses on actually making a decision in the face of uncertainty. You choose a level of wrongness you are comfortable with (rate of false positives) and then decide null vs. alternative based firmly on the rejection region. When we're not making a decision, we are somewhat artificially discarding information about the strength of evidence. We \"accept\" the null if $T = 1.95$ in the last example but reject it if $T = 1.97$ even though these two situations are actually very similar. Just reporting the reject/retain decision also fails to give us a sense of at what other levels we might have rejected the null. Again, this makes sense if we need to make a single decision: other tests don't matter because we carefully considered our $\\alpha$ level test. But in the lower-stakes world of the academic social sciences, we can afford to be more informative. \n\nOne alternative to reporting the reject/retain decision is to report a **p-value**. \n\n::: {#def-p-value}\n\nThe **p-value** of a test is the probability of observing a test statistic is at least as extreme as the observed test statistic in the direction of the alternative hypothesis. \n\n:::\n\nThe line \"in the direction of the alternative hypothesis\" deals with the unfortunate headache of one-sided versus two-sided tests. For a one-sided test where larger values of $T$ correspond to more evidence for $H_1$, the p-value is\n$$\n\\P(T(X_1,\\ldots,X_n) > T \\mid \\theta_0) = 1 - G_0(T),\n$$\nwhereas for a (symmetric) two-sided test, we have\n$$ \n\\P(|T(X_1, \\ldots, X_n)| > |T| \\mid \\theta_0) = 2(1 - G_0(|T|)).\n$$\n\nIn either case, the interpretation of the p-value is the same. It is the smallest size $\\alpha$ at which a test would reject null. Presenting a p-value allows the reader to determine their own $\\alpha$ level and determine quickly if the evidence would warrant rejecting $H_0$ in that case. Thus, the p-value is a more **continuous** measure of evidence against the null, where lower values are stronger evidence against the null because the observed result is less likely under the null. \n\nThere is a lot of controversy surrounding p-values but most of it focuses on arbitrary p-value cutoffs for determining statistical significance and sometimes publication decisions. These problems are not the fault of p-values but rather the hyper fixation on the reject/retain decision for arbitrary test levels like $\\alpha = 0.05$. It might be best to view p-values as a transformation of the test statistic onto a common scale between 0 and 1. \n\n::: {.callout-warning}\n\nPeople use many statistical shibboleths to purportedly identify people who don't understand statistics and usually hinge on seemingly subtle differences in interpretation that are easy to miss. If you know the core concepts, the statistical shibboleths tend to be overblown, but it would be malpractice not to flag them for you. \n\nThe shibboleth with p-values is that sometimes people interpret them as \"the probability that the null hypothesis is true.\" Of course, this doesn't make sense from our definition because the p-values *conditions* on the null hypothesis---it cannot tell us anything about the probability of that null hypothesis. Instead, the metaphor you should always carry is that hypothesis tests are statistical thought experiments and that p-values answer the question: how likely would my data be if the null were true? \n\n:::\n\n\n## Power analysis\n\nImagine you have spent a large research budget on a big experiment to test your amazing theory, and the results come back and... you fail to reject the null of no treatment effect. When this happens, there are two possible states of the world: the null is true, and you correctly identified that, or the null is false but the test had lower power to detect the true effect. Because of this uncertainty after the fact, it is common for researchers to conduct **power analyses** before running studies that try to forecast what sample size is necessary to ensure you can reject the null under a hypothesized effect size. \n\nGenerally power analyses involve calculating the power function $\\pi(\\theta) = \\P(T(X_1, \\ldots, X_n) \\in R \\mid \\theta)$ for different values of $\\theta$. It might also involve sample size calculations for a particular alternative, $\\theta_1$. In that case, we try to find the sample size $n$ to make the power $\\pi(\\theta_1)$ as close to a particular value (often 0.8) as possible. It is possible to solve for this sample size in simple one-sided tests explicitly. Still, for more general situations or two-sided tests, we typically need numerical or simulation-based approaches to find the optimal sample size. \n\nWith Wald tests, we can characterize the power function quite easily, even if it does not allow us to back out sample size calculations easily. \n\n::: {#thm-power}\nFor a Wald test with an asymptotically normal estimator, the power function for a particular alternative $\\theta_1 \\neq \\theta_0$ is \n$$ \n\\pi(\\theta_1) = 1 - \\Phi\\left( \\frac{\\theta_0 - \\theta_1}{\\widehat{\\se}[\\widehat{\\theta}_n]} + z_{\\alpha/2} \\right) + \\Phi\\left( \\frac{\\theta_0 - \\theta_1}{\\widehat{\\se}[\\widehat{\\theta}_n]}-z_{\\alpha/2} \\right).\n$$\n\n:::\n\n\n## Exact tests under normal data\n\nThe Wald test above relies on large sample approximations. In finite samples, these approximations may not be valid. Can we get **exact** inferences at any sample size? Yes, if we make stronger assumptions about the data. In particular, assume a **parametric model** for the data where $X_1,\\ldots,X_n$ are i.i.d. samples from $N(\\mu,\\sigma^2)$. Under null of $H_0: \\mu = \\mu_0$, we can show that \n$$ \nT_n = \\frac{\\Xbar_n - \\mu_0}{s_n/\\sqrt{n}} \\sim t_{n-1},\n$$\nwhere $t_{n-1}$ is the **Student's t-distribution** with $n-1$ degrees of freedom. This result implies the null distribution is $t$, so we use quantiles of $t$ for critical values. For one-sided test $c = G^{-1}_0(1 - \\alpha)$ but now $G_0$ is $t$ with $n-1$ df and so we use `qt()` instead of `qnorm()` to calculate these critical values. \n\nThe critical values for the $t$ distribution are always larger than the normal because the t has fatter tails, as shown in @fig-shape-of-t. As $n\\to\\infty$, however, the $t$ converges to the standard normal, and so it is asymptotically equivalent to the Wald test but slightly more conservative in finite samples. Oddly, most software packages calculate p-values and rejection regions based on the $t$ to exploit this conservativeness. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![Normal versus t distribution.](04_hypothesis_tests_files/figure-html/fig-shape-of-t-1.png){#fig-shape-of-t width=672}\n:::\n:::\n\n\n\n## Confidence intervals and hypothesis tests\n\n\nAt first glance, we may seem sloppy in using $\\alpha$ in deriving a $1 - \\alpha$ confidence interval in the last chapter and an $\\alpha$-level test in this chapter. In reality, we were foreshadowing the deep connection between the two: every $1-\\alpha$ confidence interval contains all null hypotheses that we **would not reject** with an $\\alpha$-level test. \n\nThis connection is easiest to see with an asymptotically normal estimator, $\\widehat{\\theta}_n$. Consider the hypothesis test of \n$$ \nH_0: \\theta = \\theta_0 \\quad \\text{vs.}\\quad H_1: \\theta \\neq \\theta_0,\n$$\nusing the test statistic,\n$$ \nT = \\frac{\\widehat{\\theta}_{n} - \\theta_{0}}{\\widehat{\\se}[\\widehat{\\theta}_{n}]}. \n$$\nAs we discussed in the earlier, an $\\alpha = 0.05$ test would reject this null when $|T| > 1.96$, or when \n$$ \n|\\widehat{\\theta}_{n} - \\theta_{0}| > 1.96 \\widehat{\\se}[\\widehat{\\theta}_{n}]. \n$$\nNotice that will be true when \n$$ \n\\theta_{0} < \\widehat{\\theta}_{n} - 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}]\\quad \\text{ or }\\quad \\widehat{\\theta}_{n} + \\widehat{\\se}[\\widehat{\\theta}_{n}] < \\theta_{0}\n$$\nor, equivalently, that null hypothesis is outside of the 95% confidence interval, $$\\theta_0 \\notin \\left[\\widehat{\\theta}_{n} - 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}], \\widehat{\\theta}_{n} + 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}]\\right].$$ \nOf course, our choice of the null hypothesis was arbitrary, which means that any null hypothesis outside the 95% confidence interval would be rejected by a $\\alpha = 0.05$ level test of that null. And any null hypothesis inside the confidence interval is a null hypothesis that we would not reject. \n\nThis relationship holds more broadly. Any $1-\\alpha$ confidence interval contains all possible parameter values that would not be rejected as the null hypothesis of an $\\alpha$-level hypothesis test. This connection can be handy for two reasons:\n\n1. We can quickly determine if we would reject a null hypothesis at some level by inspecting if it falls in a confidence interval. \n2. In some situations, determining a confidence interval might be difficult, but performing a hypothesis test is straightforward. Then, we can find the rejection region for the test and determine what null hypotheses would not be rejected at level $\\alpha$ to formulate the $1-\\alpha$ confidence interval. We call this process **inverting a test**. A critical application of this method is for formulating confidence intervals for treatment effects based on randomization inference in the finite population analysis of experiments. \n\n\n", + "markdown": "# Hypothesis tests\n\n\nUp to now, we have discussed the properties of estimators that allow us to characterize their distributions in finite and large samples. These properties might let us say that, for example, our estimated difference in means is equal to a true average treatment effect on average across repeated samples or that it will converge to the true value in large samples. These properties, however, are properties of repeated samples. As researchers, we will only have access to a single sample. **Statistical inference** is the process of using our single sample to learn about population parameters. Several ways to conduct inference are connected, but one of the most ubiquitous in the sciences is the hypothesis test, which is a kind of statistical thought experiment. \n\n\n\n\n## The lady tasting tea\n\nThe lady tasting tea exemplifies the core ideas behind hypothesis testing due to R.A. Fisher.[^1] Fisher had prepared tea for his colleague, the algologist Muriel Bristol. Knowing that she preferred milk in her tea, he poured milk into a tea cup and then poured the hot tea into the milk. Bristol rejected the cup, stating that she preferred pouring the tea first, then milk. Fisher was skeptical at the idea anyone could tell the difference between a cup poured milk-first or tea-first. So he and another colleague, William Roach, devised a test to see if Bristol could distinguish the two preparation methods. \n\nFisher and Roach prepared 8 cups of tea, four milk-first and four tea-first. They then presented the cups to Bristol in a random order (though she knew there were 4 of each type), and she proceeded to identify all of the cups correctly. At first glance, this seems like good evidence that she can tell the difference between the two types, but a skeptic like Fisher raised the question: \"could she have just been randomly guessing and got lucky?\" This led Fisher to a **statistical thought experiment**: what would the probability of guessing the correct cups be *if* she were guessing randomly?\n\nTo calculate the probability of Bristol's achievement, we can note that \"randomly guessing\" here would mean that she was selecting a group of 4 cups to be labeled milk-first from the 8 cups available. Using basic combinatorics, we can calculate there are 70 ways to choose 4 cups among 8, but only 1 of those arrangements would be correct. Thus, if randomly guessing means choosing among those 70 options with equal chance, then the probability of guessing the right set of cups is 1/70 or $\\approx 0.014$. The low probability implies that the hypothesis of random guessing may be implausible. \n\nThe story of the lady tasting tea encapsulates many of the core elements of hypothesis testing. Hypothesis testing is about taking our observed estimate (Bristol guessing all the cups correctly) and seeing how likely that observed estimate would be under some assumption or hypothesis about the data-generating process (Bristol was randomly guessing). When the observed estimate is unlikely under the maintained hypothesis, we might view this as evidence against that hypothesis. Thus, hypothesis tests help us assess evidence for particular guesses about the DGP. \n\n\n[^1]: The analysis here largely comes from @Senn12. \n\n\n::: {.callout-note}\n\n## Notation alert\n\nFor the rest of this chapter, we'll introduce the concepts following the notation in the past chapters. We'll usually assume that we have a random (iid) sample of random variables $X_1, \\ldots, X_n$ from a distribution, $F$. We'll focus on estimating some parameter, $\\theta$, of this distribution (like the mean, median, variance, etc.). We'll refer to $\\Theta$ as the set of possible values of $\\theta$ or the **parameter space**.\n\n:::\n\n## Hypotheses\n\nIn the context of hypothesis testing, hypotheses are just statements about the population distribution. In particular, we will make statements that $\\theta = \\theta_0$ where $\\theta_0 \\in \\Theta$ is the hypothesized value of $\\theta$. Hypotheses are ubiquitous in empirical work, but here are some examples to give you a flavor:\n\n- The population proportion of US citizens that identify as Democrats is 0.33. \n- The population difference in average voter turnout between households who received get-out-the-vote mailers vs. those who did not is 0. \n- The difference in the average incidence of human rights abuse in countries that signed a human rights treaty vs. those countries that did not sign is 0. \n\nEach of these is a statement about the true DGP. The latter two are very common: when $\\theta$ represents the difference in means between two groups, then $\\theta = 0$ is the hypothesis of no actual difference in population means or no treatment effect (if the causal effect is identified). \n\nThe goal of hypothesis testing is to adjudicate between two complementary hypotheses. \n\n::: {#def-null}\n\nThe two hypotheses in a hypothesis test are called the **null hypothesis** and the **alternative hypothesis**, denoted as $H_0$ and $H_1$, respectively. \n\n:::\n\nThese hypotheses are complementary, so if the null hypothesis $H_0: \\theta \\in \\Theta_0$, then the alternative hypothesis is $H_1: \\theta \\in \\Theta_0^c$. The \"null\" in null hypothesis might seem odd until you realize that most null hypotheses are that there is no effect of some treatment or no difference in means. For example, suppose $\\theta$ is the difference in mean support for expanding legal immigration between a treatment group that received a pro-immigrant message and some facts about immigration and a control group that just received the factual information. Then, the typical null hypothesis would be no difference in means or $H_0: \\theta = 0$, and the alternative would be $H_1: \\theta \\neq 0$. \n\nThere are two types of tests that differ in the form of their null and alternative hypotheses. A **two-sided test** is of the form\n$$\nH_0: \\theta = \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta \\neq \\theta_0,\n$$\nwhere the \"two-sided\" part refers to how the alternative contains values of $\\theta$ above and below the null value $\\theta_0$. A **one-sided test** has the form\n$$\nH_0: \\theta \\leq \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta > \\theta_0,\n$$\nor\n$$\nH_0: \\theta \\geq \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta < \\theta_0.\n$$\nTwo-sided tests are much more common in the social sciences, where we want to know if there is any evidence, positive or negative, against the presumption of no treatment effect or no relationship between two variables. One-sided tests are for situations where we only want evidence in one direction, which is rarely relevant to social science research. One-sided tests also have the downside of being misused to inflate the strength of evidence against the null and should be avoided. Unfortunately, the math of two-sided tests is also more complicated. \n\n## The procedure of hypothesis testing\n\nAt the most basic level, a **hypothesis test** is a rule that specifies values of the sample data for which we will decide to **reject** the null hypothesis. Let $\\mathcal{X}_n$ be the range of the sample---that is, all possible vectors $(x_1, \\ldots, x_n)$ that have a positive probability of occurring. Then, a hypothesis test describes a region of this space, $R \\subset \\mathcal{X}_n$, called the **rejection region** where when $(X_1, \\ldots, X_n) \\in R$ we will **reject** $H_0$ and when the data is outside this region, $(X_1, \\ldots, X_n) \\notin R$ we **retain**, **accept**, or **fail to reject** the null hypothesis.[^2]\n\n[^2]: Different people and different textbooks describe what to do when we do not reject the null hypothesis in different ways. The terminology is not so important so long as you understand that rejecting the null does not mean the null is logically false, and \"accepting\" the null does not mean the null is logically true. \n\n\n\n\nHow do we decide what the rejection region should be? Even though we define the rejection region in terms of the **sample space**, $\\mathcal{X}_n$, it's unwieldy to work with the entire vector of data. Instead, we often formulate the rejection region in terms of a **test statistic**, $T = T(X_1, \\ldots, X_n)$, where the rejection region becomes\n$$\nR = \\left\\{(x_1, \\ldots, x_n) : T(x_1, \\ldots, x_n) > c\\right\\},\n$$\nwhere $c$ is called the **critical value**. This expression says that the rejection region is the part of the sample space that makes the test statistic sufficiently large. We reject null hypotheses when the observed data is incompatible with those hypotheses, where the test statistic should be a measure of this incompatibility. Note that the test statistic is a random variable and has a distribution---we will exploit this to understand the different properties of a hypothesis test. \n\n\n\n::: {#exm-biden}\n\nSuppose that $(X_1, \\ldots, X_n)$ represents a sample of US citizens where $X_i = 1$ indicates support for the current US president and $X_i = 0$ means no support. We might be interested in the test of the null hypothesis that the president does not have the support of a majority of American citizens. Let $\\mu = \\E[X_i] = \\P(X_i = 1)$. Then, a one-sided test would compare the two hypotheses:\n$$ \nH_0: \\mu \\leq 0.5 \\quad\\text{versus}\\quad H_1: \\mu > 0.5.\n$$\nIn this case, we might use the sample mean as the test statistic, so that $T(X_1, \\ldots, X_n) = \\Xbar_n$ and we have to find some threshold above 0.5 such that we would reject the null, \n$$ \nR = \\left\\{(x_1, \\ldots, x_n): \\Xbar_n > c\\right\\}.\n$$\nIn words, how much support should we see for the current president before we reject the notion that they lack majority support? Below we will select the critical value, $c$, to have beneficial statistical properties. \n:::\n\nThe structure of a reject region will depend on whether a test is one- or two-sided. One-sided tests will take the form $T > c$, whereas two-sided tests will take the form $|T| > c$ since we want to count deviations from either side of the null hypothesis as evidence against that null. \n\n## Testing errors\n\nHypothesis tests end with a decision to reject the null hypothesis or not, but this might be an incorrect decision. In particular, there are two ways to make errors and two ways to be correct in this setting, as shown in @tbl-errors. The labels are confusing, but it's helpful to remember that **type I errors** (said \"type one\") are labeled so because they are the worse of the two types of errors. These errors occur when we reject a null (say there is a true treatment effect or relationship) when the null is true (there is no true treatment effect or relationship). Type I errors are what we see in the replication crisis: lots of \"significant\" effects that turn out later to be null. **Type II errors** (said \"type two\") are considered less problematic: there is a true relationship, but we cannot detect it with our test (we cannot reject the null). \n\n\n| | $H_0$ True | $H_0$ False |\n|--------------|--------------|---------------|\n| Retain $H_0$ | Awesome | Type II error |\n| Reject $H_0$ | Type I error | Great |\n\n: Typology of testing errors {#tbl-errors}\n\n\nIdeally, we would minimize the chances of making either a type I or type II error. Unfortunately, because the test statistic is a random variable, we cannot remove the probability of an error altogether. Instead, we will derive tests with some guaranteed performance to minimize the probability of type I error. To derive this, we can define the **power function** of a test,\n$$ \n\\pi(\\theta) = \\P\\left( \\text{Reject } H_0 \\mid \\theta \\right) = \\P\\left( T \\in R \\mid \\theta \\right),\n$$\nwhich is the probability of rejection as a function of the parameter of interest, $\\theta$. The power function tells us, for example, how likely we are to reject the null of no treatment effect as we vary the actual size of the treatment effect. \n\nWe can define the probability of type I error from the power function. \n\n::: {#def-size}\nThe **size** of a hypothesis test with the null hypothesis $H_0: \\theta = \\theta_0$ is \n$$ \n\\pi(\\theta_0) = \\P\\left( \\text{Reject } H_0 \\mid \\theta_0 \\right).\n$$\n:::\n\nYou can think of the size of a test as the rate of false positives (or false discoveries) produced by the test. @fig-size-power shows an example of rejection regions, size, and power for a one-sided test. In the left panel, we have the distribution of the test statistic under the null, with $H_0: \\theta = \\theta_0$, and the rejection region is defined by values $T > c$. The shaded grey region is the probability of rejection under this null hypothesis or the size of the test. Sometimes, we will get extreme samples by random chance, even under the null, leading to false discoveries.[^3]\n\n[^3]: Eagle-eyed readers will notice that the null tested here is a point, while we previously defined the null in a one-sided test as a region $H_0: \\theta \\leq \\theta_0$. Technically, the size of the test will vary based on which of these nulls we pick. In this example, notice that any null to the left of $\\theta_0$ will result in a lower size. And so, the null at the boundary, $\\theta_0$, will maximize the size of the test, making it the most \"conservative\" null to investigate. Technically, we should define the size of a test as $\\alpha = \\sup_{\\theta \\in \\Theta_0} \\pi(\\theta)$. \n\nIn the right panel, we overlay the distribution of the test statistic under one particular alternative, $\\theta = \\theta_1 > \\theta_0$. The red-shaded region is the probability of rejecting the null when this alternative is true for the power---it's the probability of correctly rejecting the null when it is false. Intuitively, we can see that alternatives that produce test statistics closer to the rejection region will have higher power. This makes sense: detecting big deviations from the null should be easier than detecting minor ones. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Size of a test and power against an alternative.](04_hypothesis_tests_files/figure-html/fig-size-power-1.png){#fig-size-power width=672}\n:::\n:::\n\n\n\n@fig-size-power also hints at a tradeoff between size and power. Notice that we could make the size smaller (lower the false positive rate) by increasing the critical value to $c' > c$. This would make the probability of being in the rejection region smaller, $\\P(T > c' \\mid \\theta_0) < \\P(T > c \\mid \\theta_0)$, leading to a lower-sized test. Unfortunately, it would also reduce power in the right panel since the probability of being in the rejection region will be lower under any alternative, $\\P(T > c' \\mid \\theta_1) < \\P(T > c \\mid \\theta_1)$. This means we usually cannot simultaneously reduce both types of errors. \n\n## Determining the rejection region\n\n\nIf we cannot simultaneously optimize a test's size and power, how should we determine where the rejection region is? That is, how should we decide what empirical evidence will be strong enough for us to reject the null? The standard approach to this problem in hypothesis testing is to control the size of a test (that is, control the rate of false positives) and try to maximize the power of the test subject to that constraint. So we say, \"I'm willing to accept at most x%\" of findings will be false positives and do whatever we can to maximize power subject to that constraint. \n\n::: {#def-level}\n\nA test has **significance level** $\\alpha$ if its size is less than or equal to $\\alpha$, or $\\pi(\\theta_0) \\leq \\alpha$.\n\n:::\n\nA test with a significance level of $\\alpha = 0.05$ will have a false positive/type I error rate no larger than 0.05. This level is widespread in the social sciences, though you also will see $\\alpha = 0.01$ or $\\alpha = 0.1$. Frequentists justify this by saying this means that with $\\alpha = 0.05$, there will only be at most 5% of studies that will produce false discoveries. \n\nOur task is to construct the rejection region so that the **null distribution** of the test statistic $G_0(t) = \\P(T \\leq t \\mid \\theta_0)$ has less than $\\alpha$ probability in that region. One-sided tests like in @fig-size-power are the easiest to show, even though we warned you not to use them. We want to choose $c$ that puts no more than $\\alpha$ probability in the tail, or\n$$ \n\\P(T > c \\mid \\theta_0) = 1 - G_0(c) \\leq \\alpha.\n$$\nRemember that the smaller the value of $c$ we can use will maximize power, which implies that the critical value for the maximum power while maintaining the significance level is when $1 - G_0(c) = \\alpha$. We can use the **quantile function** of the null distribution to find the exact value of $c$ we need,\n$$\nc = G^{-1}_0(1 - \\alpha),\n$$\nwhich is just fancy math to say, \"the value at which $1-\\alpha$ of the null distribution is below.\"\n\nThe determination of the rejection region follows the same principles for two-sided tests, but it is slightly more complicated because we reject when the magnitude of the test statistic is large, $|T| > c$. @fig-two-sided shows that basic setup. Notice that because there are two (disjoint) regions, we can write the size (false positive rate) as\n$$ \n\\pi(\\theta_0) = G_0(-c) + 1 - G_0(c).\n$$\nIn most cases that we will see, the null distribution for such a test will be symmetric around 0 (usually asymptotically standard normal, actually), which means that $G_0(-c) = 1 - G_0(c)$, which implies that the size is\n$$ \n\\pi(\\theta_0) = 2(1 - G_0(c)).\n$$ \nSolving for the critical value that would make this $\\alpha$ gives\n$$ \nc = G^{-1}_0(1 - \\alpha/2).\n$$\nAgain, this formula can seem dense, but remember what you are doing: finding the value that puts $\\alpha/2$ of the probability of the null distribution in each tail. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Rejection regions for a two-sided test.](04_hypothesis_tests_files/figure-html/fig-two-sided-1.png){#fig-two-sided width=672}\n:::\n:::\n\n\n\n## Hypothesis tests of the sample mean\n\nLet's go through an extended example about hypothesis testing of a sample mean, sometimes called a **one-sample test**. Let's say $X_i$ are feeling thermometer scores about \"liberals\" as a group on a scale of 0 to 100, with values closer to 0 indicating cooler feelings about liberals and values closer to 100 indicating warmer feelings about liberals. We want to know if the population average differs from a neutral value of 50. We can write this two-sided test as\n$$\nH_0: \\mu = 50 \\quad\\text{versus}\\quad H_1: \\mu \\neq 50,\n$$\nwhere $\\mu = \\E[X_i]$. The standard test statistic for this type of test is the so-called **t-statistic**, \n$$ \nT = \\frac{\\left( \\Xbar_n - \\mu_0 \\right)}{\\sqrt{s^2 / n}} =\\frac{\\left( \\Xbar_n - 50 \\right)}{\\sqrt{s^2 / n}},\n$$\nwhere $\\mu_0$ is the null value of interest and $s^2$ is the sample variance. If the null hypothesis is true, then by the CLT, we know that the t-statistic is asymptotically normal, $T \\indist \\N(0, 1)$. Thus, we can approximate the null distribution with the standard normal!\n\nLet's create a test with level $\\alpha = 0.05$. Then we need to find the rejection region that puts $0.05$ probability in the tails of the null distribution, which we just saw was $\\N(0,1)$. Let $\\Phi()$ be the CDF for the standard normal and let $\\Phi^{-1}()$ be the quantile function for the standard normal. Drawing on what we developed above, you can find the value $c$ so that $\\P(|T| > c \\mid \\mu_0)$ is 0.05 with\n$$\nc = \\Phi^{-1}(1 - 0.05/2) \\approx 1.96,\n$$\nwhich means that a test where we reject when $|T| > 1.96$ would have a level of 0.05 asymptotically. \n\n\n## The Wald test\n\nWe can generalize the hypothesis test for the sample mean to estimators more broadly. Let $\\widehat{\\theta}_n$ be an estimator for some parameter $\\theta$ and let $\\widehat{\\textsf{se}}[\\widehat{\\theta}_n]$ be a consistent estimate of the standard error of the estimator, $\\textsf{se}[\\widehat{\\theta}_n] = \\sqrt{\\V[\\widehat{\\theta}_n]}$. We consider the two-sided test\n$$\nH_0: \\theta = \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta \\neq \\theta_0.\n$$\n\nIn many cases, our estimators will be asymptotically normal by a version of the CLT so that under the null hypothesis, we have\n$$ \nT = \\frac{\\widehat{\\theta}_n - \\theta_0}{\\widehat{\\textsf{se}}[\\widehat{\\theta}_n]} \\indist \\N(0, 1). \n$$\nThe **Wald test** rejects $H_0$ when $|T| > z_{\\alpha/2}$, with $z_{\\alpha/2}$ that puts $\\alpha/2$ in the upper tail of the standard normal. That is, if $Z \\sim \\N(0, 1)$, then $z_{\\alpha/2}$ satisfies $\\P(Z \\geq z_{\\alpha/2}) = \\alpha/2$. \n\n::: {.callout-note}\n\nIn R, you can find the $z_{\\alpha/2}$ values easily with the `qnorm()` function:\n\n::: {.cell}\n\n```{.r .cell-code}\nqnorm(0.05 / 2, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.959964\n```\n:::\n:::\n\n\n:::\n\n::: {#thm-wald}\nAsymptotically, the Wald test has size $\\alpha$ such that\n$$ \n\\P(|T| > z_{\\alpha/2} \\mid \\theta_0) \\to \\alpha.\n$$\n\n:::\n\nThis result is very general, and it means that many, many hypothesis tests based on estimators will have the same form. The main difference across estimators will be how we calculate the estimated standard error. \n\n::: {#exm-two-props}\n\n## Difference in proportions\n\nIn get-out-the-vote (GOTV) experiments, we might randomly assign a group of citizens to receive mailers encouraging them to vote, whereas a control group receives no message. We'll define the turnout variables in the treatment group $Y_{1}, Y_{2}, \\ldots, Y_{n_t}$ as iid draws from a Bernoulli distribution with success $p_t$, which represents the population turnout rate among treated citizens. The outcomes in the control group $X_{1}, X_{2}, \\ldots, X_{n_c}$ are iid draws from another Bernoulli distribution with success $p_c$, which represents the population turnout rate among citizens not receiving a mailer. \n\n\nOur goal is to learn about the treatment effect of this treatment on whether or not the citizen votes, $\\tau = p_t - p_c$, and we will use the sample difference in means/proportions as our estimator, $\\widehat{\\tau} = \\Ybar - \\Xbar$. To perform a Wald test, we need to know/estimate the standard error of this estimator. Notice that because these are independent samples, the variance is\n$$ \n\\V[\\widehat{\\tau}_n] = \\V[\\Ybar - \\Xbar] = \\V[\\Ybar] + \\V[\\Xbar] = \\frac{p_t(1-p_t)}{n_t} + \\frac{p_c(1-p_c)}{n_c},\n$$\nwhere the third equality comes from the fact that the underlying outcome variables $Y_i$ and $X_j$ are binary. Obviously, we do not know the true population proportions $p_t$ and $p_c$ (that's why we're doing the test!), but we can estimate the standard error by replacing them with their estimates\n$$ \n\\widehat{\\textsf{se}}[\\widehat{\\tau}] = \\sqrt{\\frac{\\Ybar(1 -\\Ybar)}{n_t} + \\frac{\\Xbar(1-\\Xbar)}{n_c}}.\n$$\n\nThe typical null hypothesis test, in this case, is \"no treatment effect\" vs. \"some treatment effect\" or\n$$\nH_0: \\tau = p_t - p_c = 0 \\quad\\text{versus}\\quad H_1: \\tau \\neq 0,\n$$\nwhich gives the following test statistic for the Wald test\n$$\nT = \\frac{\\Ybar - \\Xbar}{\\sqrt{\\frac{\\Ybar(1 -\\Ybar)}{n_t} + \\frac{\\Xbar(1-\\Xbar)}{n_c}}}. \n$$\nIf we wanted a test with level $\\alpha = 0.01$, we would reject the null when $|T| > 2.58$ since\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqnorm(0.01/2, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2.575829\n```\n:::\n:::\n\n\n:::\n\n\n::: {#exm-diff-in-means}\n\n## Difference in means\n\nLet's take a similar setting to the last example with randomly assigned treatment and control groups, but now the treatment is an appeal for donations, and the outcomes are continuous measures of how much a person donated to the political campaign. Now the treatment data $Y_1, \\ldots, Y_{n_t}$ are iid draws from a population with mean $\\mu_t = \\E[Y_i]$ and population variance $\\sigma^2_t = \\V[Y_i]$. The control data $X_1, \\ldots, X_{n_c}$ are iid draws (independent of the $Y_i$) from a population with mean $\\mu_c = \\E[X_i]$ and population variance $\\sigma^2_c = \\V[X_i]$. The parameter of interest is similar to before: the population difference in means, $\\tau = \\mu_t - \\mu_c$, and we'll form the usual hypothesis test of\n$$ \nH_0: \\tau = \\mu_t - \\mu_c = 0 \\quad\\text{versus}\\quad H_1: \\tau \\neq 0.\n$$\n\nThe only difference between this setting and the difference in proportions is the standard error here will be different because we cannot rely on the Bernoulli. Instead, we'll use our knowledge of the sampling variance of the sample means and independence between the samples to derive \n$$\n\\V[\\widehat{\\tau}] = \\V[\\Ybar] + \\V[\\Xbar] = \\frac{\\sigma^2_t}{n_t} + \\frac{\\sigma^2_c}{n_c},\n$$\nwhere we can come up with an estimate of the unknown population variance with sample variances\n$$\n\\widehat{\\se}[\\widehat{\\tau}] = \\sqrt{\\frac{s^2_t}{n_t} + \\frac{s^2_c}{n_c}}.\n$$\nWe can use this estimator to derive the Wald test statistic of \n$$ \nT = \\frac{\\widehat{\\tau} - 0}{\\widehat{\\se}[\\widehat{\\tau}]} = \\frac{\\Ybar - \\Xbar}{\\sqrt{\\frac{s^2_t}{n_t} + \\frac{s^2_c}{n_c}}},\n$$\nand if we want an asymptotic level of 0.05, we can reject when $|T| > 1.96$.\n:::\n\n\n## p-values\n\nThe hypothesis testing framework focuses on actually making a decision in the face of uncertainty. You choose a level of wrongness you are comfortable with (rate of false positives) and then decide null vs. alternative based firmly on the rejection region. When we're not making a decision, we are somewhat artificially discarding information about the strength of evidence. We \"accept\" the null if $T = 1.95$ in the last example but reject it if $T = 1.97$ even though these two situations are actually very similar. Just reporting the reject/retain decision also fails to give us a sense of at what other levels we might have rejected the null. Again, this makes sense if we need to make a single decision: other tests don't matter because we carefully considered our $\\alpha$ level test. But in the lower-stakes world of the academic social sciences, we can afford to be more informative. \n\nOne alternative to reporting the reject/retain decision is to report a **p-value**. \n\n::: {#def-p-value}\n\nThe **p-value** of a test is the probability of observing a test statistic at least as extreme as the observed test statistic in the direction of the alternative hypothesis. \n\n:::\n\nThe line \"in the direction of the alternative hypothesis\" deals with the unfortunate headache of one-sided versus two-sided tests. For a one-sided test where larger values of $T$ correspond to more evidence for $H_1$, the p-value is\n$$\n\\P(T(X_1,\\ldots,X_n) > T \\mid \\theta_0) = 1 - G_0(T),\n$$\nwhereas for a (symmetric) two-sided test, we have\n$$ \n\\P(|T(X_1, \\ldots, X_n)| > |T| \\mid \\theta_0) = 2(1 - G_0(|T|)).\n$$\n\nIn either case, the interpretation of the p-value is the same. It is the smallest size $\\alpha$ at which a test would reject null. Presenting a p-value allows the reader to determine their own $\\alpha$ level and determine quickly if the evidence would warrant rejecting $H_0$ in that case. Thus, the p-value is a more **continuous** measure of evidence against the null, where lower values are stronger evidence against the null because the observed result is less likely under the null. \n\nThere is a lot of controversy surrounding p-values but most of it focuses on arbitrary p-value cutoffs for determining statistical significance and sometimes publication decisions. These problems are not the fault of p-values but rather the hyperfixation on the reject/retain decision for arbitrary test levels like $\\alpha = 0.05$. It might be best to view p-values as a transformation of the test statistic onto a common scale between 0 and 1. \n\n::: {.callout-warning}\n\nPeople use many statistical shibboleths to purportedly identify people who don't understand statistics and usually hinge on seemingly subtle differences in interpretation that are easy to miss. If you know the core concepts, the statistical shibboleths tend to be overblown, but it would be malpractice not to flag them for you. \n\nThe shibboleth with p-values is that sometimes people interpret them as \"the probability that the null hypothesis is true.\" Of course, this doesn't make sense from our definition because the p-value *conditions* on the null hypothesis---it cannot tell us anything about the probability of that null hypothesis. Instead, the metaphor you should always carry is that hypothesis tests are statistical thought experiments and that p-values answer the question: how likely would my data be if the null were true? \n\n:::\n\n\n## Power analysis\n\nImagine you have spent a large research budget on a big experiment to test your amazing theory, and the results come back and... you fail to reject the null of no treatment effect. When this happens, there are two possible states of the world: the null is true, and you correctly identified that, or the null is false but the test had lower power to detect the true effect. Because of this uncertainty after the fact, it is common for researchers to conduct **power analyses** before running studies that try to forecast what sample size is necessary to ensure you can reject the null under a hypothesized effect size. \n\nGenerally power analyses involve calculating the power function $\\pi(\\theta) = \\P(T(X_1, \\ldots, X_n) \\in R \\mid \\theta)$ for different values of $\\theta$. It might also involve sample size calculations for a particular alternative, $\\theta_1$. In that case, we try to find the sample size $n$ to make the power $\\pi(\\theta_1)$ as close to a particular value (often 0.8) as possible. It is possible to solve for this sample size in simple one-sided tests explicitly. Still, for more general situations or two-sided tests, we typically need numerical or simulation-based approaches to find the optimal sample size. \n\nWith Wald tests, we can characterize the power function quite easily, even if it does not allow us to back out sample size calculations easily. \n\n::: {#thm-power}\nFor a Wald test with an asymptotically normal estimator, the power function for a particular alternative $\\theta_1 \\neq \\theta_0$ is \n$$ \n\\pi(\\theta_1) = 1 - \\Phi\\left( \\frac{\\theta_0 - \\theta_1}{\\widehat{\\se}[\\widehat{\\theta}_n]} + z_{\\alpha/2} \\right) + \\Phi\\left( \\frac{\\theta_0 - \\theta_1}{\\widehat{\\se}[\\widehat{\\theta}_n]}-z_{\\alpha/2} \\right).\n$$\n\n:::\n\n\n## Exact tests under normal data\n\nThe Wald test above relies on large sample approximations. In finite samples, these approximations may not be valid. Can we get **exact** inferences at any sample size? Yes, if we make stronger assumptions about the data. In particular, assume a **parametric model** for the data where $X_1,\\ldots,X_n$ are iid samples from $N(\\mu,\\sigma^2)$. Under a null of $H_0: \\mu = \\mu_0$, we can show that \n$$ \nT_n = \\frac{\\Xbar_n - \\mu_0}{s_n/\\sqrt{n}} \\sim t_{n-1},\n$$\nwhere $t_{n-1}$ is the **Student's t-distribution** with $n-1$ degrees of freedom. This result implies the null distribution is $t$, so we use quantiles of $t$ for critical values. For a one-sided test, $c = G^{-1}_0(1 - \\alpha)$, but now $G_0$ is $t$ with $n-1$ df and so we use `qt()` instead of `qnorm()` to calculate these critical values. \n\nThe critical values for the $t$ distribution are always larger than the normal because the t has fatter tails, as shown in @fig-shape-of-t. As $n\\to\\infty$, however, the $t$ converges to the standard normal, and so it is asymptotically equivalent to the Wald test but slightly more conservative in finite samples. Oddly, most software packages calculate p-values and rejection regions based on the $t$ to exploit this conservativeness. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![Normal versus t distribution.](04_hypothesis_tests_files/figure-html/fig-shape-of-t-1.png){#fig-shape-of-t width=672}\n:::\n:::\n\n\n\n## Confidence intervals and hypothesis tests\n\n\nAt first glance, we may seem sloppy in using $\\alpha$ in deriving a $1 - \\alpha$ confidence interval in the last chapter and an $\\alpha$-level test in this chapter. In reality, we were foreshadowing the deep connection between the two: every $1-\\alpha$ confidence interval contains all null hypotheses that we **would not reject** with an $\\alpha$-level test. \n\nThis connection is easiest to see with an asymptotically normal estimator, $\\widehat{\\theta}_n$. Consider the hypothesis test of \n$$ \nH_0: \\theta = \\theta_0 \\quad \\text{vs.}\\quad H_1: \\theta \\neq \\theta_0,\n$$\nusing the test statistic,\n$$ \nT = \\frac{\\widehat{\\theta}_{n} - \\theta_{0}}{\\widehat{\\se}[\\widehat{\\theta}_{n}]}. \n$$\nAs we discussed earlier, an $\\alpha = 0.05$ test would reject this null when $|T| > 1.96$, or when \n$$ \n|\\widehat{\\theta}_{n} - \\theta_{0}| > 1.96 \\widehat{\\se}[\\widehat{\\theta}_{n}]. \n$$\nNotice that will be true when \n$$ \n\\theta_{0} < \\widehat{\\theta}_{n} - 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}]\\quad \\text{ or }\\quad \\widehat{\\theta}_{n} + 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}] < \\theta_{0}\n$$\nor, equivalently, that null hypothesis is outside of the 95% confidence interval, $$\\theta_0 \\notin \\left[\\widehat{\\theta}_{n} - 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}], \\widehat{\\theta}_{n} + 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}]\\right].$$ \nOf course, our choice of the null hypothesis was arbitrary, which means that any null hypothesis outside the 95% confidence interval would be rejected by a $\\alpha = 0.05$ level test of that null. And any null hypothesis inside the confidence interval is a null hypothesis that we would not reject. \n\nThis relationship holds more broadly. Any $1-\\alpha$ confidence interval contains all possible parameter values that would not be rejected as the null hypothesis of an $\\alpha$-level hypothesis test. This connection can be handy for two reasons:\n\n1. We can quickly determine if we would reject a null hypothesis at some level by inspecting if it falls in a confidence interval. \n2. In some situations, determining a confidence interval might be difficult, but performing a hypothesis test is straightforward. Then, we can find the rejection region for the test and determine what null hypotheses would not be rejected at level $\\alpha$ to formulate the $1-\\alpha$ confidence interval. We call this process **inverting a test**. A critical application of this method is for formulating confidence intervals for treatment effects based on randomization inference in the finite population analysis of experiments. \n\n\n", "supporting": [ "04_hypothesis_tests_files/figure-html" ], diff --git a/_freeze/04_hypothesis_tests/execute-results/tex.json b/_freeze/04_hypothesis_tests/execute-results/tex.json index 82e65e9..e57bf9c 100644 --- a/_freeze/04_hypothesis_tests/execute-results/tex.json +++ b/_freeze/04_hypothesis_tests/execute-results/tex.json @@ -1,7 +1,7 @@ { - "hash": "55f1ee4b011ed2824a645cec73fb7784", + "hash": "dccb64733a8ec7b56cb5e180514f556f", "result": { - "markdown": "# Hypothesis tests\n\n\nUp to now, we have discussed the properties of estimators that allow us to characterize their distributions in finite and large samples. These properties might let us say that, for example, our estimated difference in means is equal to a true average treatment effect on average across repeated samples or that it will converge to the true value in large samples. These properties, however, are properties of repeated samples. As researchers, we will only have access to a single sample. **Statistical inference** is the process of using our single sample to learn about population parameters. Several ways to conduct inference are connected, but one of the most ubiquitous in the sciences is the hypothesis test, which is a kind of statistical thought experiment. \n\n\n\n\n## The lady tasting tea\n\nThe lady tasting tea exemplifies the core ideas behind hypothesis testing due to R.A. Fisher.[^1] Fisher had prepared tea for his colleague, the algologist Muriel Bristol. Knowing that she preferred milk in her tea, he poured milk into a tea cup and then poured the hot tea into the milk. Bristol rejected the cup, stating that she preferred pouring the tea first, then milk. Fisher was skeptical at the idea anyone could tell the difference between a cup poured milk-first or tea-first. So he and another colleague, William Roach, devised a test to see if Bristol could distinguish the two preparation methods. \n\nFisher and Roach prepared 8 cups of tea, four milk-first and four tea-first. They then presented the cups to Bristol in a random order (though she knew there were 4 of each type), and she proceeded to identify all of the cups correctly. At first glance, this seems like good evidence that she can tell the difference between the two types, but a skeptic like Fisher raised the question: \"could she have just been randomly guessing and got lucky?\" This led Fisher to a **statistical thought experiment**: what would the probability of guessing the correct cups be *if* she were guessing randomly?\n\nTo calculate the probability of Bristol's achievement, we can note that \"randomly guessing\" here would mean that she was selecting a group of 4 cups to be labeled milk-first from the 8 cups available. Using basic combinatorics, we can calculate there are 70 ways to choose 4 cups among 8, but only 1 of those arrangements would be correct. Thus, if randomly guessing means choosing among those 70 options with equal chance, then the probability of guessing the right set of cups is 1/70 or $\\approx 0.014$. The low probability implies that the hypothesis of random guessing may be implausible. \n\nThe story of the lady tasting tea encapsulates many of the core elements of hypothesis testing. Hypothesis testing is about taking our observed estimate (Bristol guessing all the cups correctly) and seeing how likely that observed estimate would be under some assumption or hypothesis about the data-generating process (Bristol was randomly guessing). When the observed estimate is unlikely under the maintained hypothesis, we might view this as evidence against that hypothesis. Thus, hypothesis tests help us assess evidence for particular guesses about the DGP. \n\n\n[^1]: The analysis here largely comes from @Senn12. \n\n\n::: {.callout-note}\n\n## Notation alert\n\nFor the rest of this chapter, we'll introduce the concepts following the notation in the past chapters. We'll usually assume that we have a random (iid) sample of random variables $X_1, \\ldots, X_n$ from a distribution, $F$. We'll focus on estimating some parameter, $\\theta$ of this distribution (like the mean, median, variance, etc.). We'll refer to $\\Theta$ as the set of possible values of $\\theta$ or the **parameter space**.\n\n:::\n\n## Hypotheses\n\nIn the context of hypothesis testing, hypotheses are just statements about the population distribution. In particular, we will make statements that $\\theta = \\theta_0$ where $\\theta_0 \\in \\Theta$ is the hypothesized value of $\\theta$. Hypotheses are ubiquitous in empirical work, but here are some examples to give you a flavor:\n\n- The population proportion of US citizens that identify as Democrats is 0.33. \n- The population difference in average voter turnout between households who received get-out-the-vote mailers vs. those who did not is 0. \n- The difference in the average incidence of human rights abuse in countries that signed a human rights treaty vs. those countries that did not sign is 0. \n\nEach of these is a statement about the true DGP. The latter two are very common: when $\\theta$ represents the difference in means between two groups, then $\\theta = 0$ is the hypothesis of no actual difference in population means or no treatment effect (if the causal effect is identified). \n\nThe goal of hypothesis testing is to adjudicate between two complementary hypotheses. \n\n::: {#def-null}\n\nThe two hypotheses in a hypothesis test are called the **null hypothesis** and the **alternative hypothesis**, denoted as $H_0$ and $H_1$, respectively. \n\n:::\n\nThese hypotheses are complementary, so if the null hypothesis $H_0: \\theta \\in \\Theta_0$, then the alternative hypothesis is $H_1: \\theta \\in \\Theta_0^c$. The \"null\" in null hypothesis might seem odd until you realize that most null hypotheses are that there is no effect of some treatment or no difference in means. For example, suppose $\\theta$ is the difference in mean support for expanding legal immigration between a treatment group that received a pro-immigrant message and some facts about immigration and a control group that just received the factual information. Then, the typical null hypothesis would be no difference in means or $H_0: \\theta = 0$, and the alternative would be $H_1: \\theta \\neq 0$. \n\nThere are two types of tests that differ in the form of their null and alternative hypotheses. A **two-sided test** is of the form\n$$\nH_0: \\theta = \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta \\neq \\theta_0,\n$$\nwhere the \"two-sided\" part refers to how the alternative contains values of $\\theta$ above and below the null value $\\theta_0$. A **one-sided test** has the form\n$$\nH_0: \\theta \\leq \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta > \\theta_0,\n$$\nor\n$$\nH_0: \\theta \\geq \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta < \\theta_0.\n$$\nTwo-sided tests are much more common in the social sciences, where we want to know if there is any evidence, positive or negative, against the presumption of no treatment effect or no relationship between two variables. One-sided tests are for situations where we only want evidence in one direction, which is rarely relevant to social science research. One-sided tests also have the downside of being misused to inflate the strength of evidence against the null and should be avoided. Unfortunately, the math of two-sided tests is also more complicated. \n\n## The procedure of hypothesis testing\n\nAt the most basic level, a **hypothesis test** is a rule that specifies values of the sample data for which we will decide to **reject** the null hypothesis. Let $\\mathcal{X}_n$ be the range of the sample---that is, all possible vectors $(x_1, \\ldots, x_n)$ that have positive probability of occurring. Then, a hypothesis test describes a region of this space, $R \\subset \\mathcal{X}_n$, called the **rejection region** where when $(X_1, \\ldots, X_n) \\in R$ we will **reject** $H_0$ and when the data is outside this region, $(X_1, \\ldots, X_n) \\notin R$ we **retain**, **accept**, or **fail to reject** the null hypothesis.[^2]\n\n[^2]: Different people and different textbooks describe what to do when do not reject the null hypothesis in different ways. The terminology is not so important so long as you understand that rejecting the null does not mean the null is logically false, and \"accepting\" the null does not mean the null is logically true. \n\n\n\n\nHow do we decide what the rejection region should be? Even though we define the rejection region in terms of the **sample space**, $\\mathcal{X}_n$, it's unwieldy to work with the entire vector of data. Instead, we often formulate the rejection region in terms of a **test statistic**, $T = T(X_1, \\ldots, X_n)$, where the rejection region becomes\n$$\nR = \\left\\{(x_1, \\ldots, x_n) : T(x_1, \\ldots, x_n) > c\\right\\},\n$$\nwhere $c$ is called the **critical value**. This expression says that the rejection region is the part of the sample space that makes the test statistic sufficiently large. We reject null hypotheses when the observed data is incompatible with those hypotheses, where the test statistic should be a measure of this incompatibility. Note that the test statistic is a random variable and has a distribution---we will exploit this to understand the different properties of a hypothesis test. \n\n\n\n::: {#exm-biden}\n\nSuppose that $(X_1, \\ldots, X_n)$ represents a sample of US citizens where $X_i = 1$ indicates support for the current US president and $X_i = 0$ means no support. We might be interested in the test of the null hypothesis that the president does not have the support of a majority of American citizens. Let $\\mu = \\E[X_i] = \\P(X_i = 1)$. Then, a one-sided test would compare the two hypotheses:\n$$ \nH_0: \\mu \\leq 0.5 \\quad\\text{versus}\\quad H_1: \\mu > 0.5.\n$$\nIn this case, we might use the sample mean as the test statistic, so that $T(X_1, \\ldots, X_n) = \\Xbar_n$ and we have to find some threshold above 0.5 such that we would reject the null, \n$$ \nR = \\left\\{(x_1, \\ldots, x_n): \\Xbar_n > c\\right\\}.\n$$\nIn words, how much support should we see for the current president before we reject the notion that they lack majority support? Below we will select the critical value, $c$, to have beneficial statistical properties. \n:::\n\nThe structure of a reject region will depend on whether a test is one- or two-sided. One-sided tests will take the form $T > c$, whereas two-sided tests will take the form $|T| > c$ since we want to count deviations from either side of the null hypothesis as evidence against that null. \n\n## Testing errors\n\nHypothesis tests end with a decision to reject the null hypothesis or not, but this might be an incorrect decision. In particular, there are two ways to make errors and two ways to be correct in this setting, as shown in @tbl-errors. The labels are confusing, but it's helpful to remember that **type I errors** (said \"type one\") are labeled so because they are the worse of the two types of errors. These errors occur when we reject a null (say there is a true treatment effect or relationship) when the null is true (there is no true treatment effect or relationship). Type I errors are what we see in the replication crisis: lots of \"significant\" effects that turn out later to be null. **Type II errors** (said \"type two\") are considered less problematic: there is a true relationship, but we cannot detect it with our test (we cannot reject the null). \n\n\n| | $H_0$ True | $H_0$ False |\n|--------------|--------------|---------------|\n| Retain $H_0$ | Awesome | Type II error |\n| Reject $H_0$ | Type I error | Great |\n\n: Typology of testing errors {#tbl-errors}\n\n\nIdeally, we would minimize the chances of making either a type I or type II error. Unfortunately, because the test statistic is a random variable, we cannot remove the probability of an error altogether. Instead, we will derive tests with some guaranteed performance to minimize the probability of type I error. To derive this, we can define the **power function** of a test,\n$$ \n\\pi(\\theta) = \\P\\left( \\text{Reject } H_0 \\mid \\theta \\right) = \\P\\left( T \\in R \\mid \\theta \\right),\n$$\nwhich is the probability of rejection as a function of the parameter of interest, $\\theta$. The power function tells us, for example, how likely we are to reject the null of no treatment effect as we vary the actual size of the treatment effect. \n\nWe can define the probability of type I error from the power function. \n\n::: {#def-size}\nThe **size** of a hypothesis test with the null hypothesis $H_0: \\theta = \\theta_0$ is \n$$ \n\\pi(\\theta_0) = \\P\\left( \\text{Reject } H_0 \\mid \\theta_0 \\right).\n$$\n:::\n\nYou can think of the size of a test as the rate of false positives (or false discoveries) produced by the test. @fig-size-power shows an example of rejection regions, size, and power for a one-sided test. In the left panel, we have the distribution of the test statistic under the null, with $H_0: \\theta = \\theta_0$, and the rejection region is defined by values $T > c$. The shaded grey region is the probability of rejection under this null hypothesis or the size of the test. Sometimes, we will get extreme samples by random chance, even under the null, leading to false discoveries.[^3]\n\n[^3]: Eagle-eyed readers will notice that the null tested here is a point, while we previously defined the null in a one-sided test as a region $H_0: \\theta \\leq \\theta_0$. Technically, the size of the test will vary based on which of these nulls we pick. In this example, notice that any null to the left of $\\theta_0$ will result in a lower size. And so, the null at the boundary, $\\theta_0$, will maximize the size of the test, making it the most \"conservative\" null to investigate. Technically, we should define the size of a test as $\\alpha = \\sup_{\\theta \\in \\Theta_0} \\pi(\\theta)$. \n\nIn the right panel, we overlay the distribution of the test statistic under one particular alternative, $\\theta = \\theta_1 > \\theta_0$. The red-shaded region is the probability of rejecting the null when this alternative is true or the power---it's the probability of correctly rejecting the null when it is false. Intuitively, we can see that alternatives that produce test statistics closer to the rejection region will have higher power. This makes sense: detecting big deviations from the null should be easier than detecting minor ones. \n\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Size of a test and power against an alternative.](04_hypothesis_tests_files/figure-pdf/fig-size-power-1.pdf){#fig-size-power}\n:::\n:::\n\n\n\n\n@fig-size-power also hints at a tradeoff between size and power. Notice that we could make the size smaller (lower the false positive rate) by increasing the critical value to $c' > c$. This would make the probability of being in the rejection region smaller, $\\P(T > c' \\mid \\theta_0) < \\P(T > c \\mid \\theta_0)$, leading to a lower-sized test. Unfortunately, it would also reduce power in the right panel since the probability of being in the rejection region will be lower under any alternative, $\\P(T > c' \\mid \\theta_1) < \\P(T > c \\mid \\theta_1)$. This means we usually cannot simultaneously reduce both types of errors. \n\n## Determining the rejection region\n\n\nIf we cannot simultaneously optimize a test's size and power, how should we determine where the reject region is? That is, how should we decide what empirical evidence will be strong enough for us to reject the null? The standard approach to this problem in hypothesis testing is to control the size of a test (that is, control the rate of false positives) and try to maximize the power of the test subject to that constraint. So we say, \"I'm willing to accept at most x%\" of findings will be false positives and do whatever we can to maximize power subject to that constraint. \n\n::: {#def-level}\n\nA test has **significance level** $\\alpha$ if its size is less than or equal to $\\alpha$, or $\\pi(\\theta_0) \\leq \\alpha$.\n\n:::\n\nA test with a significance level of $\\alpha = 0.05$ will have a false positive/type I error rate no larger than 0.05. This level is widespread in the social sciences, though you also will $\\alpha = 0.01$ or $\\alpha = 0.1$. Frequentists justify this by saying this means that with $\\alpha = 0.05$, there will only be 5% of studies that will produce false discoveries. \n\nOur task is to construct the rejection region so that the **null distribution** of the test statistic $G_0(t) = \\P(T \\leq t \\mid \\theta_0)$ has less than $\\alpha$ probability in that region. One-sided tests like in @fig-size-power are the easiest to show, even though we warned you not to use them. We want to choose $c$ that puts no more than $\\alpha$ probability in the tail, or\n$$ \n\\P(T > c \\mid \\theta_0) = 1 - G_0(c) \\leq \\alpha.\n$$\nRemembering that the smaller the value of $c$ we can use will maximize power, which implies that the critical value for the maximum power while maintaining the significance level is when $1 - G_0(c) = \\alpha$. We can use the **quantile function** of the null distribution to find the exact value of $c$ we need,\n$$\nc = G^{-1}_0(1 - \\alpha),\n$$\nwhich is just fancy math to say, \"the value at which $1-\\alpha$ of the null distribution is below.\"\n\nThe determination of the rejection region follows the same principles for two-sided tests, but it is slightly more complicated because we reject when the magnitude of the test statistic is large, $|T| > c$. @fig-two-sided shows that basic setup. Notice that because there are two (disjoint) regions, we can write the size (false positive rate) as\n$$ \n\\pi(\\theta_0) = G_0(-c) + 1 - G_0(c)\n$$\nIn most cases that we will see, the null distribution for such a test will be symmetric around 0 (usually asymptotically standard normal, actually), which means that $G_0(-c) = 1 - G_0(c)$, which implies that the size is\n$$ \n\\pi(\\theta_0) = 2(1 - G_0(c)).\n$$ \nSolving for the critical value that would make this $\\alpha$ gives\n$$ \nc = G^{-1}_0(1 - \\alpha/2).\n$$\nAgain, this formula can seem dense, but remember what you are doing: finding the value that puts $\\alpha/2$ of the probability of the null distribution in each tail. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Rejection regions for a two-sided test.](04_hypothesis_tests_files/figure-pdf/fig-two-sided-1.pdf){#fig-two-sided}\n:::\n:::\n\n\n\n\n## Hypothesis tests of the sample mean\n\nLet's go through an extended example about hypothesis testing of a sample mean, sometimes called a **one-sample test**. Let's say $X_i$ are feeling thermometer scores about \"liberals\" as a group on a scale of 0 to 100, with values closer to 0 indicating cooler feelings about liberals and values closer to 100 indicating warmer feelings about liberals. We want to know if the population average differs from a neutral value of 50. We can write this two-sided test as\n$$\nH_0: \\mu = 50 \\quad\\text{versus}\\quad H_1: \\mu \\neq 50,\n$$\nwhere $\\mu = \\E[X_i]$. The standard test statistic for this type of test is the so-called **t-statistic**, \n$$ \nT = \\frac{\\left( \\Xbar_n - \\mu_0 \\right)}{\\sqrt{s^2 / n}} =\\frac{\\left( \\Xbar_n - 50 \\right)}{\\sqrt{s^2 / n}},\n$$\nwhere $\\mu_0$ is the null value of interest and $s^2$ is the sample variance. If the null hypothesis is true, then by the CLT, we know that the t-statistic is asymptotically normal, $T \\indist \\N(0, 1)$. Thus, we can approximate the null distribution with the standard normal!\n\nLet's create a test with level $\\alpha = 0.05$. Then we need to find the rejection region that puts $0.05$ probability in the tails of the null distribution, which we just saw was $\\N(0,1)$. Let $\\Phi()$ be the CDF for the standard normal and let $\\Phi^{-1}()$ be the quantile function for the standard normal. Drawing on what we developed above, you can find the value $c$ so that $\\P(|T| > c \\mid \\mu_0)$ is 0.05 with\n$$\nc = \\Phi^{-1}(1 - 0.05/2) \\approx 1.96,\n$$\nwhich means that a test where we reject when $|T| > 1.96$ would have a level of 0.05 asymptotically. \n\n\n## The Wald test\n\nWe can generalize the hypothesis test for the sample mean to estimators more broadly. Let $\\widehat{\\theta}_n$ be an estimator for some parameter $\\theta$ and let $\\widehat{\\textsf{se}}[\\widehat{\\theta}_n]$ be a consistent estimate of the standard error of the estimator, $\\textsf{se}[\\widehat{\\theta}_n] = \\sqrt{\\V[\\widehat{\\theta}_n]}$. We consider the two-sided test\n$$\nH_0: \\theta = \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta \\neq \\theta_0.\n$$\n\nIn many cases, our estimators will be asymptotically normal by a version of the CLT so that under the null hypothesis, we have\n$$ \nT = \\frac{\\widehat{\\theta}_n - \\theta_0}{\\widehat{\\textsf{se}}[\\widehat{\\theta}_n]} \\indist \\N(0, 1). \n$$\nThe **Wald test** rejects $H_0$ when $|T| > z_{\\alpha/2}$, where $z_{\\alpha/2}$ that puts $\\alpha/2$ in the upper tail of the standard normal. That is, if $Z \\sim \\N(0, 1)$, then $z_{\\alpha/2}$ satisfies $\\P(Z \\geq z_{\\alpha/2}) = \\alpha/2$. \n\n::: {.callout-note}\n\nIn R, you can find the $z_{\\alpha/2}$ values easily with the `qnorm()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqnorm(0.05 / 2, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.959964\n```\n:::\n:::\n\n\n\n:::\n\n::: {#thm-wald}\nAsymptotically, the Wald test has size $\\alpha$ such that\n$$ \n\\P(|T| > z_{\\alpha/2} \\mid \\theta_0) \\to \\alpha.\n$$\n\n:::\n\nThis result is very general, and it means that many, many hypothesis tests based on estimators will have the same form. The main difference across estimators will be how we calculate the estimated standard error. \n\n::: {#exm-two-props}\n\n## Difference in proportions\n\nIn get-out-the-vote (GOTV) experiments, we might randomly assign a group of citizens to receive mailers encouraging them to vote, whereas a control group receives no message. We'll define the turnout variables in the treatment group $Y_{1}, Y_{2}, \\ldots, Y_{n_t}$ as iid draws from a Bernoulli distribution with success $p_t$, which represents the population turnout rate among treated citizens. The outcomes in the control group $X_{1}, X_{2}, \\ldots, X_{n_c}$ are iid draws from another Bernoulli distribution with success $p_c$, which represents the population turnout rate among citizens not receiving a mailer. \n\n\nOur goal is to learn about the treatment effect of this treatment on whether or not the citizen votes, $\\tau = p_t - p_c$, and we will use the sample difference in means/proportions as our estimator, $\\widehat{\\tau} = \\Ybar - \\Xbar$. To perform a Wald test, we need to know/estimate the standard error of this estimator. Notice that because these are independent samples, the variance is\n$$ \n\\V[\\widehat{\\tau}_n] = \\V[\\Ybar - \\Xbar] = \\V[\\Ybar] + \\V[\\Xbar] = \\frac{p_t(1-p_t)}{n_t} + \\frac{p_c(1-p_c)}{n_c},\n$$\nwhere the third equality comes from the fact that the underlying outcome variables $Y_i$ and $X_j$ are binary. Obviously, we do not know the true population proportions $p_t$ and $p_c$ (that's why we're doing the test!), but we can estimate the standard error by replacing them with their estimates\n$$ \n\\widehat{\\textsf{se}}[\\widehat{\\tau}] = \\sqrt{\\frac{\\Ybar(1 -\\Ybar)}{n_t} + \\frac{\\Xbar(1-\\Xbar)}{n_c}}.\n$$\n\nThe typical null hypothesis test, in this case, is \"no treatment effect\" vs. \"some treatment effect\" or\n$$\nH_0: \\tau = p_t - p_c = 0 \\quad\\text{versus}\\quad H_1: \\tau \\neq 0,\n$$\nwhich gives the following test statistic for the Wald test\n$$\nT = \\frac{\\Ybar - \\Xbar}{\\sqrt{\\frac{\\Ybar(1 -\\Ybar)}{n_t} + \\frac{\\Xbar(1-\\Xbar)}{n_c}}}. \n$$\nIf we wanted a test with level $\\alpha = 0.01$, we would reject the null when $|T| > 2.58$ since\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqnorm(0.01/2, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2.575829\n```\n:::\n:::\n\n\n\n:::\n\n\n::: {#exm-diff-in-means}\n\n## Difference in means\n\nLet's take a similar setting to the last example with randomly assigned treatment and control groups, but now the treatment is an appeal for donations, and the outcomes are continuous measures of how much a person donated to the political campaign. Now the treatment data $Y_1, \\ldots, Y_{n_t}$ are iid draws from a population with mean $\\mu_t = \\E[Y_i]$ and population variance $\\sigma^2_t = \\V[Y_i]$. The control data $X_1, \\ldots, X_{n_c}$ are iid draws (independent of the $Y_i$) from a population with mean $\\mu_c = \\E[X_i]$ and population variance $\\sigma^2_c = \\V[X_i]$. The parameter of interest is similar to before: the population difference in means, $\\tau = \\mu_t - \\mu_c$, and we'll form the usual hypothesis test of\n$$ \nH_0: \\tau = \\mu_t - \\mu_c = 0 \\quad\\text{versus}\\quad H_1: \\tau \\neq 0.\n$$\n\nThe only difference between this setting and the difference in proportions is the standard error here will be different because we cannot rely on the Bernoulli. Instead, we'll use our knowledge of the sampling variance of the sample means and independence between the samples to derive \n$$\n\\V[\\widehat{\\tau}] = \\V[\\Ybar] + \\V[\\Xbar] = \\frac{\\sigma^2_t}{n_t} + \\frac{\\sigma^2_c}{n_c},\n$$\nwhere we can come up with an estimate of the unknown population variance with sample variances\n$$\n\\widehat{\\se}[\\widehat{\\tau}] = \\sqrt{\\frac{s^2_t}{n_t} + \\frac{s^2_c}{n_c}}.\n$$\nWe can use this estimator to derive the Wald test statistic of \n$$ \nT = \\frac{\\widehat{\\tau} - 0}{\\widehat{\\se}[\\widehat{\\tau}]} = \\frac{\\Ybar - \\Xbar}{\\sqrt{\\frac{s^2_t}{n_t} + \\frac{s^2_c}{n_c}}},\n$$\nand if we want an asymptotically level of 0.05, we can reject when $|T| > 1.96$.\n:::\n\n\n## p-values\n\nThe hypothesis testing framework focuses on actually making a decision in the face of uncertainty. You choose a level of wrongness you are comfortable with (rate of false positives) and then decide null vs. alternative based firmly on the rejection region. When we're not making a decision, we are somewhat artificially discarding information about the strength of evidence. We \"accept\" the null if $T = 1.95$ in the last example but reject it if $T = 1.97$ even though these two situations are actually very similar. Just reporting the reject/retain decision also fails to give us a sense of at what other levels we might have rejected the null. Again, this makes sense if we need to make a single decision: other tests don't matter because we carefully considered our $\\alpha$ level test. But in the lower-stakes world of the academic social sciences, we can afford to be more informative. \n\nOne alternative to reporting the reject/retain decision is to report a **p-value**. \n\n::: {#def-p-value}\n\nThe **p-value** of a test is the probability of observing a test statistic is at least as extreme as the observed test statistic in the direction of the alternative hypothesis. \n\n:::\n\nThe line \"in the direction of the alternative hypothesis\" deals with the unfortunate headache of one-sided versus two-sided tests. For a one-sided test where larger values of $T$ correspond to more evidence for $H_1$, the p-value is\n$$\n\\P(T(X_1,\\ldots,X_n) > T \\mid \\theta_0) = 1 - G_0(T),\n$$\nwhereas for a (symmetric) two-sided test, we have\n$$ \n\\P(|T(X_1, \\ldots, X_n)| > |T| \\mid \\theta_0) = 2(1 - G_0(|T|)).\n$$\n\nIn either case, the interpretation of the p-value is the same. It is the smallest size $\\alpha$ at which a test would reject null. Presenting a p-value allows the reader to determine their own $\\alpha$ level and determine quickly if the evidence would warrant rejecting $H_0$ in that case. Thus, the p-value is a more **continuous** measure of evidence against the null, where lower values are stronger evidence against the null because the observed result is less likely under the null. \n\nThere is a lot of controversy surrounding p-values but most of it focuses on arbitrary p-value cutoffs for determining statistical significance and sometimes publication decisions. These problems are not the fault of p-values but rather the hyper fixation on the reject/retain decision for arbitrary test levels like $\\alpha = 0.05$. It might be best to view p-values as a transformation of the test statistic onto a common scale between 0 and 1. \n\n::: {.callout-warning}\n\nPeople use many statistical shibboleths to purportedly identify people who don't understand statistics and usually hinge on seemingly subtle differences in interpretation that are easy to miss. If you know the core concepts, the statistical shibboleths tend to be overblown, but it would be malpractice not to flag them for you. \n\nThe shibboleth with p-values is that sometimes people interpret them as \"the probability that the null hypothesis is true.\" Of course, this doesn't make sense from our definition because the p-values *conditions* on the null hypothesis---it cannot tell us anything about the probability of that null hypothesis. Instead, the metaphor you should always carry is that hypothesis tests are statistical thought experiments and that p-values answer the question: how likely would my data be if the null were true? \n\n:::\n\n\n## Power analysis\n\nImagine you have spent a large research budget on a big experiment to test your amazing theory, and the results come back and... you fail to reject the null of no treatment effect. When this happens, there are two possible states of the world: the null is true, and you correctly identified that, or the null is false but the test had lower power to detect the true effect. Because of this uncertainty after the fact, it is common for researchers to conduct **power analyses** before running studies that try to forecast what sample size is necessary to ensure you can reject the null under a hypothesized effect size. \n\nGenerally power analyses involve calculating the power function $\\pi(\\theta) = \\P(T(X_1, \\ldots, X_n) \\in R \\mid \\theta)$ for different values of $\\theta$. It might also involve sample size calculations for a particular alternative, $\\theta_1$. In that case, we try to find the sample size $n$ to make the power $\\pi(\\theta_1)$ as close to a particular value (often 0.8) as possible. It is possible to solve for this sample size in simple one-sided tests explicitly. Still, for more general situations or two-sided tests, we typically need numerical or simulation-based approaches to find the optimal sample size. \n\nWith Wald tests, we can characterize the power function quite easily, even if it does not allow us to back out sample size calculations easily. \n\n::: {#thm-power}\nFor a Wald test with an asymptotically normal estimator, the power function for a particular alternative $\\theta_1 \\neq \\theta_0$ is \n$$ \n\\pi(\\theta_1) = 1 - \\Phi\\left( \\frac{\\theta_0 - \\theta_1}{\\widehat{\\se}[\\widehat{\\theta}_n]} + z_{\\alpha/2} \\right) + \\Phi\\left( \\frac{\\theta_0 - \\theta_1}{\\widehat{\\se}[\\widehat{\\theta}_n]}-z_{\\alpha/2} \\right).\n$$\n\n:::\n\n\n## Exact tests under normal data\n\nThe Wald test above relies on large sample approximations. In finite samples, these approximations may not be valid. Can we get **exact** inferences at any sample size? Yes, if we make stronger assumptions about the data. In particular, assume a **parametric model** for the data where $X_1,\\ldots,X_n$ are i.i.d. samples from $N(\\mu,\\sigma^2)$. Under null of $H_0: \\mu = \\mu_0$, we can show that \n$$ \nT_n = \\frac{\\Xbar_n - \\mu_0}{s_n/\\sqrt{n}} \\sim t_{n-1},\n$$\nwhere $t_{n-1}$ is the **Student's t-distribution** with $n-1$ degrees of freedom. This result implies the null distribution is $t$, so we use quantiles of $t$ for critical values. For one-sided test $c = G^{-1}_0(1 - \\alpha)$ but now $G_0$ is $t$ with $n-1$ df and so we use `qt()` instead of `qnorm()` to calculate these critical values. \n\nThe critical values for the $t$ distribution are always larger than the normal because the t has fatter tails, as shown in @fig-shape-of-t. As $n\\to\\infty$, however, the $t$ converges to the standard normal, and so it is asymptotically equivalent to the Wald test but slightly more conservative in finite samples. Oddly, most software packages calculate p-values and rejection regions based on the $t$ to exploit this conservativeness. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Normal versus t distribution.](04_hypothesis_tests_files/figure-pdf/fig-shape-of-t-1.pdf){#fig-shape-of-t}\n:::\n:::\n\n\n\n\n## Confidence intervals and hypothesis tests\n\n\nAt first glance, we may seem sloppy in using $\\alpha$ in deriving a $1 - \\alpha$ confidence interval in the last chapter and an $\\alpha$-level test in this chapter. In reality, we were foreshadowing the deep connection between the two: every $1-\\alpha$ confidence interval contains all null hypotheses that we **would not reject** with an $\\alpha$-level test. \n\nThis connection is easiest to see with an asymptotically normal estimator, $\\widehat{\\theta}_n$. Consider the hypothesis test of \n$$ \nH_0: \\theta = \\theta_0 \\quad \\text{vs.}\\quad H_1: \\theta \\neq \\theta_0,\n$$\nusing the test statistic,\n$$ \nT = \\frac{\\widehat{\\theta}_{n} - \\theta_{0}}{\\widehat{\\se}[\\widehat{\\theta}_{n}]}. \n$$\nAs we discussed in the earlier, an $\\alpha = 0.05$ test would reject this null when $|T| > 1.96$, or when \n$$ \n|\\widehat{\\theta}_{n} - \\theta_{0}| > 1.96 \\widehat{\\se}[\\widehat{\\theta}_{n}]. \n$$\nNotice that will be true when \n$$ \n\\theta_{0} < \\widehat{\\theta}_{n} - 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}]\\quad \\text{ or }\\quad \\widehat{\\theta}_{n} + \\widehat{\\se}[\\widehat{\\theta}_{n}] < \\theta_{0}\n$$\nor, equivalently, that null hypothesis is outside of the 95% confidence interval, $$\\theta_0 \\notin \\left[\\widehat{\\theta}_{n} - 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}], \\widehat{\\theta}_{n} + 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}]\\right].$$ \nOf course, our choice of the null hypothesis was arbitrary, which means that any null hypothesis outside the 95% confidence interval would be rejected by a $\\alpha = 0.05$ level test of that null. And any null hypothesis inside the confidence interval is a null hypothesis that we would not reject. \n\nThis relationship holds more broadly. Any $1-\\alpha$ confidence interval contains all possible parameter values that would not be rejected as the null hypothesis of an $\\alpha$-level hypothesis test. This connection can be handy for two reasons:\n\n1. We can quickly determine if we would reject a null hypothesis at some level by inspecting if it falls in a confidence interval. \n2. In some situations, determining a confidence interval might be difficult, but performing a hypothesis test is straightforward. Then, we can find the rejection region for the test and determine what null hypotheses would not be rejected at level $\\alpha$ to formulate the $1-\\alpha$ confidence interval. We call this process **inverting a test**. A critical application of this method is for formulating confidence intervals for treatment effects based on randomization inference in the finite population analysis of experiments. \n\n\n", + "markdown": "# Hypothesis tests\n\n\nUp to now, we have discussed the properties of estimators that allow us to characterize their distributions in finite and large samples. These properties might let us say that, for example, our estimated difference in means is equal to a true average treatment effect on average across repeated samples or that it will converge to the true value in large samples. These properties, however, are properties of repeated samples. As researchers, we will only have access to a single sample. **Statistical inference** is the process of using our single sample to learn about population parameters. Several ways to conduct inference are connected, but one of the most ubiquitous in the sciences is the hypothesis test, which is a kind of statistical thought experiment. \n\n\n\n\n## The lady tasting tea\n\nThe lady tasting tea exemplifies the core ideas behind hypothesis testing due to R.A. Fisher.[^1] Fisher had prepared tea for his colleague, the algologist Muriel Bristol. Knowing that she preferred milk in her tea, he poured milk into a tea cup and then poured the hot tea into the milk. Bristol rejected the cup, stating that she preferred pouring the tea first, then milk. Fisher was skeptical at the idea anyone could tell the difference between a cup poured milk-first or tea-first. So he and another colleague, William Roach, devised a test to see if Bristol could distinguish the two preparation methods. \n\nFisher and Roach prepared 8 cups of tea, four milk-first and four tea-first. They then presented the cups to Bristol in a random order (though she knew there were 4 of each type), and she proceeded to identify all of the cups correctly. At first glance, this seems like good evidence that she can tell the difference between the two types, but a skeptic like Fisher raised the question: \"could she have just been randomly guessing and got lucky?\" This led Fisher to a **statistical thought experiment**: what would the probability of guessing the correct cups be *if* she were guessing randomly?\n\nTo calculate the probability of Bristol's achievement, we can note that \"randomly guessing\" here would mean that she was selecting a group of 4 cups to be labeled milk-first from the 8 cups available. Using basic combinatorics, we can calculate there are 70 ways to choose 4 cups among 8, but only 1 of those arrangements would be correct. Thus, if randomly guessing means choosing among those 70 options with equal chance, then the probability of guessing the right set of cups is 1/70 or $\\approx 0.014$. The low probability implies that the hypothesis of random guessing may be implausible. \n\nThe story of the lady tasting tea encapsulates many of the core elements of hypothesis testing. Hypothesis testing is about taking our observed estimate (Bristol guessing all the cups correctly) and seeing how likely that observed estimate would be under some assumption or hypothesis about the data-generating process (Bristol was randomly guessing). When the observed estimate is unlikely under the maintained hypothesis, we might view this as evidence against that hypothesis. Thus, hypothesis tests help us assess evidence for particular guesses about the DGP. \n\n\n[^1]: The analysis here largely comes from @Senn12. \n\n\n::: {.callout-note}\n\n## Notation alert\n\nFor the rest of this chapter, we'll introduce the concepts following the notation in the past chapters. We'll usually assume that we have a random (iid) sample of random variables $X_1, \\ldots, X_n$ from a distribution, $F$. We'll focus on estimating some parameter, $\\theta$, of this distribution (like the mean, median, variance, etc.). We'll refer to $\\Theta$ as the set of possible values of $\\theta$ or the **parameter space**.\n\n:::\n\n## Hypotheses\n\nIn the context of hypothesis testing, hypotheses are just statements about the population distribution. In particular, we will make statements that $\\theta = \\theta_0$ where $\\theta_0 \\in \\Theta$ is the hypothesized value of $\\theta$. Hypotheses are ubiquitous in empirical work, but here are some examples to give you a flavor:\n\n- The population proportion of US citizens that identify as Democrats is 0.33. \n- The population difference in average voter turnout between households who received get-out-the-vote mailers vs. those who did not is 0. \n- The difference in the average incidence of human rights abuse in countries that signed a human rights treaty vs. those countries that did not sign is 0. \n\nEach of these is a statement about the true DGP. The latter two are very common: when $\\theta$ represents the difference in means between two groups, then $\\theta = 0$ is the hypothesis of no actual difference in population means or no treatment effect (if the causal effect is identified). \n\nThe goal of hypothesis testing is to adjudicate between two complementary hypotheses. \n\n::: {#def-null}\n\nThe two hypotheses in a hypothesis test are called the **null hypothesis** and the **alternative hypothesis**, denoted as $H_0$ and $H_1$, respectively. \n\n:::\n\nThese hypotheses are complementary, so if the null hypothesis $H_0: \\theta \\in \\Theta_0$, then the alternative hypothesis is $H_1: \\theta \\in \\Theta_0^c$. The \"null\" in null hypothesis might seem odd until you realize that most null hypotheses are that there is no effect of some treatment or no difference in means. For example, suppose $\\theta$ is the difference in mean support for expanding legal immigration between a treatment group that received a pro-immigrant message and some facts about immigration and a control group that just received the factual information. Then, the typical null hypothesis would be no difference in means or $H_0: \\theta = 0$, and the alternative would be $H_1: \\theta \\neq 0$. \n\nThere are two types of tests that differ in the form of their null and alternative hypotheses. A **two-sided test** is of the form\n$$\nH_0: \\theta = \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta \\neq \\theta_0,\n$$\nwhere the \"two-sided\" part refers to how the alternative contains values of $\\theta$ above and below the null value $\\theta_0$. A **one-sided test** has the form\n$$\nH_0: \\theta \\leq \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta > \\theta_0,\n$$\nor\n$$\nH_0: \\theta \\geq \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta < \\theta_0.\n$$\nTwo-sided tests are much more common in the social sciences, where we want to know if there is any evidence, positive or negative, against the presumption of no treatment effect or no relationship between two variables. One-sided tests are for situations where we only want evidence in one direction, which is rarely relevant to social science research. One-sided tests also have the downside of being misused to inflate the strength of evidence against the null and should be avoided. Unfortunately, the math of two-sided tests is also more complicated. \n\n## The procedure of hypothesis testing\n\nAt the most basic level, a **hypothesis test** is a rule that specifies values of the sample data for which we will decide to **reject** the null hypothesis. Let $\\mathcal{X}_n$ be the range of the sample---that is, all possible vectors $(x_1, \\ldots, x_n)$ that have a positive probability of occurring. Then, a hypothesis test describes a region of this space, $R \\subset \\mathcal{X}_n$, called the **rejection region** where when $(X_1, \\ldots, X_n) \\in R$ we will **reject** $H_0$ and when the data is outside this region, $(X_1, \\ldots, X_n) \\notin R$ we **retain**, **accept**, or **fail to reject** the null hypothesis.[^2]\n\n[^2]: Different people and different textbooks describe what to do when we do not reject the null hypothesis in different ways. The terminology is not so important so long as you understand that rejecting the null does not mean the null is logically false, and \"accepting\" the null does not mean the null is logically true. \n\n\n\n\nHow do we decide what the rejection region should be? Even though we define the rejection region in terms of the **sample space**, $\\mathcal{X}_n$, it's unwieldy to work with the entire vector of data. Instead, we often formulate the rejection region in terms of a **test statistic**, $T = T(X_1, \\ldots, X_n)$, where the rejection region becomes\n$$\nR = \\left\\{(x_1, \\ldots, x_n) : T(x_1, \\ldots, x_n) > c\\right\\},\n$$\nwhere $c$ is called the **critical value**. This expression says that the rejection region is the part of the sample space that makes the test statistic sufficiently large. We reject null hypotheses when the observed data is incompatible with those hypotheses, where the test statistic should be a measure of this incompatibility. Note that the test statistic is a random variable and has a distribution---we will exploit this to understand the different properties of a hypothesis test. \n\n\n\n::: {#exm-biden}\n\nSuppose that $(X_1, \\ldots, X_n)$ represents a sample of US citizens where $X_i = 1$ indicates support for the current US president and $X_i = 0$ means no support. We might be interested in the test of the null hypothesis that the president does not have the support of a majority of American citizens. Let $\\mu = \\E[X_i] = \\P(X_i = 1)$. Then, a one-sided test would compare the two hypotheses:\n$$ \nH_0: \\mu \\leq 0.5 \\quad\\text{versus}\\quad H_1: \\mu > 0.5.\n$$\nIn this case, we might use the sample mean as the test statistic, so that $T(X_1, \\ldots, X_n) = \\Xbar_n$ and we have to find some threshold above 0.5 such that we would reject the null, \n$$ \nR = \\left\\{(x_1, \\ldots, x_n): \\Xbar_n > c\\right\\}.\n$$\nIn words, how much support should we see for the current president before we reject the notion that they lack majority support? Below we will select the critical value, $c$, to have beneficial statistical properties. \n:::\n\nThe structure of a reject region will depend on whether a test is one- or two-sided. One-sided tests will take the form $T > c$, whereas two-sided tests will take the form $|T| > c$ since we want to count deviations from either side of the null hypothesis as evidence against that null. \n\n## Testing errors\n\nHypothesis tests end with a decision to reject the null hypothesis or not, but this might be an incorrect decision. In particular, there are two ways to make errors and two ways to be correct in this setting, as shown in @tbl-errors. The labels are confusing, but it's helpful to remember that **type I errors** (said \"type one\") are labeled so because they are the worse of the two types of errors. These errors occur when we reject a null (say there is a true treatment effect or relationship) when the null is true (there is no true treatment effect or relationship). Type I errors are what we see in the replication crisis: lots of \"significant\" effects that turn out later to be null. **Type II errors** (said \"type two\") are considered less problematic: there is a true relationship, but we cannot detect it with our test (we cannot reject the null). \n\n\n| | $H_0$ True | $H_0$ False |\n|--------------|--------------|---------------|\n| Retain $H_0$ | Awesome | Type II error |\n| Reject $H_0$ | Type I error | Great |\n\n: Typology of testing errors {#tbl-errors}\n\n\nIdeally, we would minimize the chances of making either a type I or type II error. Unfortunately, because the test statistic is a random variable, we cannot remove the probability of an error altogether. Instead, we will derive tests with some guaranteed performance to minimize the probability of type I error. To derive this, we can define the **power function** of a test,\n$$ \n\\pi(\\theta) = \\P\\left( \\text{Reject } H_0 \\mid \\theta \\right) = \\P\\left( T \\in R \\mid \\theta \\right),\n$$\nwhich is the probability of rejection as a function of the parameter of interest, $\\theta$. The power function tells us, for example, how likely we are to reject the null of no treatment effect as we vary the actual size of the treatment effect. \n\nWe can define the probability of type I error from the power function. \n\n::: {#def-size}\nThe **size** of a hypothesis test with the null hypothesis $H_0: \\theta = \\theta_0$ is \n$$ \n\\pi(\\theta_0) = \\P\\left( \\text{Reject } H_0 \\mid \\theta_0 \\right).\n$$\n:::\n\nYou can think of the size of a test as the rate of false positives (or false discoveries) produced by the test. @fig-size-power shows an example of rejection regions, size, and power for a one-sided test. In the left panel, we have the distribution of the test statistic under the null, with $H_0: \\theta = \\theta_0$, and the rejection region is defined by values $T > c$. The shaded grey region is the probability of rejection under this null hypothesis or the size of the test. Sometimes, we will get extreme samples by random chance, even under the null, leading to false discoveries.[^3]\n\n[^3]: Eagle-eyed readers will notice that the null tested here is a point, while we previously defined the null in a one-sided test as a region $H_0: \\theta \\leq \\theta_0$. Technically, the size of the test will vary based on which of these nulls we pick. In this example, notice that any null to the left of $\\theta_0$ will result in a lower size. And so, the null at the boundary, $\\theta_0$, will maximize the size of the test, making it the most \"conservative\" null to investigate. Technically, we should define the size of a test as $\\alpha = \\sup_{\\theta \\in \\Theta_0} \\pi(\\theta)$. \n\nIn the right panel, we overlay the distribution of the test statistic under one particular alternative, $\\theta = \\theta_1 > \\theta_0$. The red-shaded region is the probability of rejecting the null when this alternative is true for the power---it's the probability of correctly rejecting the null when it is false. Intuitively, we can see that alternatives that produce test statistics closer to the rejection region will have higher power. This makes sense: detecting big deviations from the null should be easier than detecting minor ones. \n\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Size of a test and power against an alternative.](04_hypothesis_tests_files/figure-pdf/fig-size-power-1.pdf){#fig-size-power}\n:::\n:::\n\n\n\n\n@fig-size-power also hints at a tradeoff between size and power. Notice that we could make the size smaller (lower the false positive rate) by increasing the critical value to $c' > c$. This would make the probability of being in the rejection region smaller, $\\P(T > c' \\mid \\theta_0) < \\P(T > c \\mid \\theta_0)$, leading to a lower-sized test. Unfortunately, it would also reduce power in the right panel since the probability of being in the rejection region will be lower under any alternative, $\\P(T > c' \\mid \\theta_1) < \\P(T > c \\mid \\theta_1)$. This means we usually cannot simultaneously reduce both types of errors. \n\n## Determining the rejection region\n\n\nIf we cannot simultaneously optimize a test's size and power, how should we determine where the rejection region is? That is, how should we decide what empirical evidence will be strong enough for us to reject the null? The standard approach to this problem in hypothesis testing is to control the size of a test (that is, control the rate of false positives) and try to maximize the power of the test subject to that constraint. So we say, \"I'm willing to accept at most x%\" of findings will be false positives and do whatever we can to maximize power subject to that constraint. \n\n::: {#def-level}\n\nA test has **significance level** $\\alpha$ if its size is less than or equal to $\\alpha$, or $\\pi(\\theta_0) \\leq \\alpha$.\n\n:::\n\nA test with a significance level of $\\alpha = 0.05$ will have a false positive/type I error rate no larger than 0.05. This level is widespread in the social sciences, though you also will see $\\alpha = 0.01$ or $\\alpha = 0.1$. Frequentists justify this by saying this means that with $\\alpha = 0.05$, there will only be 5% of studies that will produce false discoveries. \n\nOur task is to construct the rejection region so that the **null distribution** of the test statistic $G_0(t) = \\P(T \\leq t \\mid \\theta_0)$ has less than $\\alpha$ probability in that region. One-sided tests like in @fig-size-power are the easiest to show, even though we warned you not to use them. We want to choose $c$ that puts no more than $\\alpha$ probability in the tail, or\n$$ \n\\P(T > c \\mid \\theta_0) = 1 - G_0(c) \\leq \\alpha.\n$$\nRemember that the smaller the value of $c$ we can use will maximize power, which implies that the critical value for the maximum power while maintaining the significance level is when $1 - G_0(c) = \\alpha$. We can use the **quantile function** of the null distribution to find the exact value of $c$ we need,\n$$\nc = G^{-1}_0(1 - \\alpha),\n$$\nwhich is just fancy math to say, \"the value at which $1-\\alpha$ of the null distribution is below.\"\n\nThe determination of the rejection region follows the same principles for two-sided tests, but it is slightly more complicated because we reject when the magnitude of the test statistic is large, $|T| > c$. @fig-two-sided shows that basic setup. Notice that because there are two (disjoint) regions, we can write the size (false positive rate) as\n$$ \n\\pi(\\theta_0) = G_0(-c) + 1 - G_0(c).\n$$\nIn most cases that we will see, the null distribution for such a test will be symmetric around 0 (usually asymptotically standard normal, actually), which means that $G_0(-c) = 1 - G_0(c)$, which implies that the size is\n$$ \n\\pi(\\theta_0) = 2(1 - G_0(c)).\n$$ \nSolving for the critical value that would make this $\\alpha$ gives\n$$ \nc = G^{-1}_0(1 - \\alpha/2).\n$$\nAgain, this formula can seem dense, but remember what you are doing: finding the value that puts $\\alpha/2$ of the probability of the null distribution in each tail. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Rejection regions for a two-sided test.](04_hypothesis_tests_files/figure-pdf/fig-two-sided-1.pdf){#fig-two-sided}\n:::\n:::\n\n\n\n\n## Hypothesis tests of the sample mean\n\nLet's go through an extended example about hypothesis testing of a sample mean, sometimes called a **one-sample test**. Let's say $X_i$ are feeling thermometer scores about \"liberals\" as a group on a scale of 0 to 100, with values closer to 0 indicating cooler feelings about liberals and values closer to 100 indicating warmer feelings about liberals. We want to know if the population average differs from a neutral value of 50. We can write this two-sided test as\n$$\nH_0: \\mu = 50 \\quad\\text{versus}\\quad H_1: \\mu \\neq 50,\n$$\nwhere $\\mu = \\E[X_i]$. The standard test statistic for this type of test is the so-called **t-statistic**, \n$$ \nT = \\frac{\\left( \\Xbar_n - \\mu_0 \\right)}{\\sqrt{s^2 / n}} =\\frac{\\left( \\Xbar_n - 50 \\right)}{\\sqrt{s^2 / n}},\n$$\nwhere $\\mu_0$ is the null value of interest and $s^2$ is the sample variance. If the null hypothesis is true, then by the CLT, we know that the t-statistic is asymptotically normal, $T \\indist \\N(0, 1)$. Thus, we can approximate the null distribution with the standard normal!\n\nLet's create a test with level $\\alpha = 0.05$. Then we need to find the rejection region that puts $0.05$ probability in the tails of the null distribution, which we just saw was $\\N(0,1)$. Let $\\Phi()$ be the CDF for the standard normal and let $\\Phi^{-1}()$ be the quantile function for the standard normal. Drawing on what we developed above, you can find the value $c$ so that $\\P(|T| > c \\mid \\mu_0)$ is 0.05 with\n$$\nc = \\Phi^{-1}(1 - 0.05/2) \\approx 1.96,\n$$\nwhich means that a test where we reject when $|T| > 1.96$ would have a level of 0.05 asymptotically. \n\n\n## The Wald test\n\nWe can generalize the hypothesis test for the sample mean to estimators more broadly. Let $\\widehat{\\theta}_n$ be an estimator for some parameter $\\theta$ and let $\\widehat{\\textsf{se}}[\\widehat{\\theta}_n]$ be a consistent estimate of the standard error of the estimator, $\\textsf{se}[\\widehat{\\theta}_n] = \\sqrt{\\V[\\widehat{\\theta}_n]}$. We consider the two-sided test\n$$\nH_0: \\theta = \\theta_0 \\quad\\text{versus}\\quad H_1: \\theta \\neq \\theta_0.\n$$\n\nIn many cases, our estimators will be asymptotically normal by a version of the CLT so that under the null hypothesis, we have\n$$ \nT = \\frac{\\widehat{\\theta}_n - \\theta_0}{\\widehat{\\textsf{se}}[\\widehat{\\theta}_n]} \\indist \\N(0, 1). \n$$\nThe **Wald test** rejects $H_0$ when $|T| > z_{\\alpha/2}$, with $z_{\\alpha/2}$ that puts $\\alpha/2$ in the upper tail of the standard normal. That is, if $Z \\sim \\N(0, 1)$, then $z_{\\alpha/2}$ satisfies $\\P(Z \\geq z_{\\alpha/2}) = \\alpha/2$. \n\n::: {.callout-note}\n\nIn R, you can find the $z_{\\alpha/2}$ values easily with the `qnorm()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqnorm(0.05 / 2, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.959964\n```\n:::\n:::\n\n\n\n:::\n\n::: {#thm-wald}\nAsymptotically, the Wald test has size $\\alpha$ such that\n$$ \n\\P(|T| > z_{\\alpha/2} \\mid \\theta_0) \\to \\alpha.\n$$\n\n:::\n\nThis result is very general, and it means that many, many hypothesis tests based on estimators will have the same form. The main difference across estimators will be how we calculate the estimated standard error. \n\n::: {#exm-two-props}\n\n## Difference in proportions\n\nIn get-out-the-vote (GOTV) experiments, we might randomly assign a group of citizens to receive mailers encouraging them to vote, whereas a control group receives no message. We'll define the turnout variables in the treatment group $Y_{1}, Y_{2}, \\ldots, Y_{n_t}$ as iid draws from a Bernoulli distribution with success $p_t$, which represents the population turnout rate among treated citizens. The outcomes in the control group $X_{1}, X_{2}, \\ldots, X_{n_c}$ are iid draws from another Bernoulli distribution with success $p_c$, which represents the population turnout rate among citizens not receiving a mailer. \n\n\nOur goal is to learn about the treatment effect of this treatment on whether or not the citizen votes, $\\tau = p_t - p_c$, and we will use the sample difference in means/proportions as our estimator, $\\widehat{\\tau} = \\Ybar - \\Xbar$. To perform a Wald test, we need to know/estimate the standard error of this estimator. Notice that because these are independent samples, the variance is\n$$ \n\\V[\\widehat{\\tau}_n] = \\V[\\Ybar - \\Xbar] = \\V[\\Ybar] + \\V[\\Xbar] = \\frac{p_t(1-p_t)}{n_t} + \\frac{p_c(1-p_c)}{n_c},\n$$\nwhere the third equality comes from the fact that the underlying outcome variables $Y_i$ and $X_j$ are binary. Obviously, we do not know the true population proportions $p_t$ and $p_c$ (that's why we're doing the test!), but we can estimate the standard error by replacing them with their estimates\n$$ \n\\widehat{\\textsf{se}}[\\widehat{\\tau}] = \\sqrt{\\frac{\\Ybar(1 -\\Ybar)}{n_t} + \\frac{\\Xbar(1-\\Xbar)}{n_c}}.\n$$\n\nThe typical null hypothesis test, in this case, is \"no treatment effect\" vs. \"some treatment effect\" or\n$$\nH_0: \\tau = p_t - p_c = 0 \\quad\\text{versus}\\quad H_1: \\tau \\neq 0,\n$$\nwhich gives the following test statistic for the Wald test\n$$\nT = \\frac{\\Ybar - \\Xbar}{\\sqrt{\\frac{\\Ybar(1 -\\Ybar)}{n_t} + \\frac{\\Xbar(1-\\Xbar)}{n_c}}}. \n$$\nIf we wanted a test with level $\\alpha = 0.01$, we would reject the null when $|T| > 2.58$ since\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqnorm(0.01/2, lower.tail = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2.575829\n```\n:::\n:::\n\n\n\n:::\n\n\n::: {#exm-diff-in-means}\n\n## Difference in means\n\nLet's take a similar setting to the last example with randomly assigned treatment and control groups, but now the treatment is an appeal for donations, and the outcomes are continuous measures of how much a person donated to the political campaign. Now the treatment data $Y_1, \\ldots, Y_{n_t}$ are iid draws from a population with mean $\\mu_t = \\E[Y_i]$ and population variance $\\sigma^2_t = \\V[Y_i]$. The control data $X_1, \\ldots, X_{n_c}$ are iid draws (independent of the $Y_i$) from a population with mean $\\mu_c = \\E[X_i]$ and population variance $\\sigma^2_c = \\V[X_i]$. The parameter of interest is similar to before: the population difference in means, $\\tau = \\mu_t - \\mu_c$, and we'll form the usual hypothesis test of\n$$ \nH_0: \\tau = \\mu_t - \\mu_c = 0 \\quad\\text{versus}\\quad H_1: \\tau \\neq 0.\n$$\n\nThe only difference between this setting and the difference in proportions is the standard error here will be different because we cannot rely on the Bernoulli. Instead, we'll use our knowledge of the sampling variance of the sample means and independence between the samples to derive \n$$\n\\V[\\widehat{\\tau}] = \\V[\\Ybar] + \\V[\\Xbar] = \\frac{\\sigma^2_t}{n_t} + \\frac{\\sigma^2_c}{n_c},\n$$\nwhere we can come up with an estimate of the unknown population variance with sample variances\n$$\n\\widehat{\\se}[\\widehat{\\tau}] = \\sqrt{\\frac{s^2_t}{n_t} + \\frac{s^2_c}{n_c}}.\n$$\nWe can use this estimator to derive the Wald test statistic of \n$$ \nT = \\frac{\\widehat{\\tau} - 0}{\\widehat{\\se}[\\widehat{\\tau}]} = \\frac{\\Ybar - \\Xbar}{\\sqrt{\\frac{s^2_t}{n_t} + \\frac{s^2_c}{n_c}}},\n$$\nand if we want an asymptotically level of 0.05, we can reject when $|T| > 1.96$.\n:::\n\n\n## p-values\n\nThe hypothesis testing framework focuses on actually making a decision in the face of uncertainty. You choose a level of wrongness you are comfortable with (rate of false positives) and then decide null vs. alternative based firmly on the rejection region. When we're not making a decision, we are somewhat artificially discarding information about the strength of evidence. We \"accept\" the null if $T = 1.95$ in the last example but reject it if $T = 1.97$ even though these two situations are actually very similar. Just reporting the reject/retain decision also fails to give us a sense of at what other levels we might have rejected the null. Again, this makes sense if we need to make a single decision: other tests don't matter because we carefully considered our $\\alpha$ level test. But in the lower-stakes world of the academic social sciences, we can afford to be more informative. \n\nOne alternative to reporting the reject/retain decision is to report a **p-value**. \n\n::: {#def-p-value}\n\nThe **p-value** of a test is the probability of observing a test statistic is at least as extreme as the observed test statistic in the direction of the alternative hypothesis. \n\n:::\n\nThe line \"in the direction of the alternative hypothesis\" deals with the unfortunate headache of one-sided versus two-sided tests. For a one-sided test where larger values of $T$ correspond to more evidence for $H_1$, the p-value is\n$$\n\\P(T(X_1,\\ldots,X_n) > T \\mid \\theta_0) = 1 - G_0(T),\n$$\nwhereas for a (symmetric) two-sided test, we have\n$$ \n\\P(|T(X_1, \\ldots, X_n)| > |T| \\mid \\theta_0) = 2(1 - G_0(|T|)).\n$$\n\nIn either case, the interpretation of the p-value is the same. It is the smallest size $\\alpha$ at which a test would reject null. Presenting a p-value allows the reader to determine their own $\\alpha$ level and determine quickly if the evidence would warrant rejecting $H_0$ in that case. Thus, the p-value is a more **continuous** measure of evidence against the null, where lower values are stronger evidence against the null because the observed result is less likely under the null. \n\nThere is a lot of controversy surrounding p-values but most of it focuses on arbitrary p-value cutoffs for determining statistical significance and sometimes publication decisions. These problems are not the fault of p-values but rather the hyperfixation on the reject/retain decision for arbitrary test levels like $\\alpha = 0.05$. It might be best to view p-values as a transformation of the test statistic onto a common scale between 0 and 1. \n\n::: {.callout-warning}\n\nPeople use many statistical shibboleths to purportedly identify people who don't understand statistics and usually hinge on seemingly subtle differences in interpretation that are easy to miss. If you know the core concepts, the statistical shibboleths tend to be overblown, but it would be malpractice not to flag them for you. \n\nThe shibboleth with p-values is that sometimes people interpret them as \"the probability that the null hypothesis is true.\" Of course, this doesn't make sense from our definition because the p-values *conditions* on the null hypothesis---it cannot tell us anything about the probability of that null hypothesis. Instead, the metaphor you should always carry is that hypothesis tests are statistical thought experiments and that p-values answer the question: how likely would my data be if the null were true? \n\n:::\n\n\n## Power analysis\n\nImagine you have spent a large research budget on a big experiment to test your amazing theory, and the results come back and... you fail to reject the null of no treatment effect. When this happens, there are two possible states of the world: the null is true, and you correctly identified that, or the null is false but the test had lower power to detect the true effect. Because of this uncertainty after the fact, it is common for researchers to conduct **power analyses** before running studies that try to forecast what sample size is necessary to ensure you can reject the null under a hypothesized effect size. \n\nGenerally power analyses involve calculating the power function $\\pi(\\theta) = \\P(T(X_1, \\ldots, X_n) \\in R \\mid \\theta)$ for different values of $\\theta$. It might also involve sample size calculations for a particular alternative, $\\theta_1$. In that case, we try to find the sample size $n$ to make the power $\\pi(\\theta_1)$ as close to a particular value (often 0.8) as possible. It is possible to solve for this sample size in simple one-sided tests explicitly. Still, for more general situations or two-sided tests, we typically need numerical or simulation-based approaches to find the optimal sample size. \n\nWith Wald tests, we can characterize the power function quite easily, even if it does not allow us to back out sample size calculations easily. \n\n::: {#thm-power}\nFor a Wald test with an asymptotically normal estimator, the power function for a particular alternative $\\theta_1 \\neq \\theta_0$ is \n$$ \n\\pi(\\theta_1) = 1 - \\Phi\\left( \\frac{\\theta_0 - \\theta_1}{\\widehat{\\se}[\\widehat{\\theta}_n]} + z_{\\alpha/2} \\right) + \\Phi\\left( \\frac{\\theta_0 - \\theta_1}{\\widehat{\\se}[\\widehat{\\theta}_n]}-z_{\\alpha/2} \\right).\n$$\n\n:::\n\n\n## Exact tests under normal data\n\nThe Wald test above relies on large sample approximations. In finite samples, these approximations may not be valid. Can we get **exact** inferences at any sample size? Yes, if we make stronger assumptions about the data. In particular, assume a **parametric model** for the data where $X_1,\\ldots,X_n$ are iid samples from $N(\\mu,\\sigma^2)$. Under a null of $H_0: \\mu = \\mu_0$, we can show that \n$$ \nT_n = \\frac{\\Xbar_n - \\mu_0}{s_n/\\sqrt{n}} \\sim t_{n-1},\n$$\nwhere $t_{n-1}$ is the **Student's t-distribution** with $n-1$ degrees of freedom. This result implies the null distribution is $t$, so we use quantiles of $t$ for critical values. For a one-sided test, $c = G^{-1}_0(1 - \\alpha)$, but now $G_0$ is $t$ with $n-1$ df and so we use `qt()` instead of `qnorm()` to calculate these critical values. \n\nThe critical values for the $t$ distribution are always larger than the normal because the t has fatter tails, as shown in @fig-shape-of-t. As $n\\to\\infty$, however, the $t$ converges to the standard normal, and so it is asymptotically equivalent to the Wald test but slightly more conservative in finite samples. Oddly, most software packages calculate p-values and rejection regions based on the $t$ to exploit this conservativeness. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Normal versus t distribution.](04_hypothesis_tests_files/figure-pdf/fig-shape-of-t-1.pdf){#fig-shape-of-t}\n:::\n:::\n\n\n\n\n## Confidence intervals and hypothesis tests\n\n\nAt first glance, we may seem sloppy in using $\\alpha$ in deriving a $1 - \\alpha$ confidence interval in the last chapter and an $\\alpha$-level test in this chapter. In reality, we were foreshadowing the deep connection between the two: every $1-\\alpha$ confidence interval contains all null hypotheses that we **would not reject** with an $\\alpha$-level test. \n\nThis connection is easiest to see with an asymptotically normal estimator, $\\widehat{\\theta}_n$. Consider the hypothesis test of \n$$ \nH_0: \\theta = \\theta_0 \\quad \\text{vs.}\\quad H_1: \\theta \\neq \\theta_0,\n$$\nusing the test statistic,\n$$ \nT = \\frac{\\widehat{\\theta}_{n} - \\theta_{0}}{\\widehat{\\se}[\\widehat{\\theta}_{n}]}. \n$$\nAs we discussed earlier, an $\\alpha = 0.05$ test would reject this null when $|T| > 1.96$, or when \n$$ \n|\\widehat{\\theta}_{n} - \\theta_{0}| > 1.96 \\widehat{\\se}[\\widehat{\\theta}_{n}]. \n$$\nNotice that will be true when \n$$ \n\\theta_{0} < \\widehat{\\theta}_{n} - 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}]\\quad \\text{ or }\\quad \\widehat{\\theta}_{n} + 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}] < \\theta_{0}\n$$\nor, equivalently, that null hypothesis is outside of the 95% confidence interval, $$\\theta_0 \\notin \\left[\\widehat{\\theta}_{n} - 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}], \\widehat{\\theta}_{n} + 1.96\\widehat{\\se}[\\widehat{\\theta}_{n}]\\right].$$ \nOf course, our choice of the null hypothesis was arbitrary, which means that any null hypothesis outside the 95% confidence interval would be rejected by a $\\alpha = 0.05$ level test of that null. And any null hypothesis inside the confidence interval is a null hypothesis that we would not reject. \n\nThis relationship holds more broadly. Any $1-\\alpha$ confidence interval contains all possible parameter values that would not be rejected as the null hypothesis of an $\\alpha$-level hypothesis test. This connection can be handy for two reasons:\n\n1. We can quickly determine if we would reject a null hypothesis at some level by inspecting if it falls in a confidence interval. \n2. In some situations, determining a confidence interval might be difficult, but performing a hypothesis test is straightforward. Then, we can find the rejection region for the test and determine what null hypotheses would not be rejected at level $\\alpha$ to formulate the $1-\\alpha$ confidence interval. We call this process **inverting a test**. A critical application of this method is for formulating confidence intervals for treatment effects based on randomization inference in the finite population analysis of experiments. \n\n\n", "supporting": [ "04_hypothesis_tests_files/figure-pdf" ], diff --git a/_freeze/04_hypothesis_tests/figure-pdf/fig-shape-of-t-1.pdf b/_freeze/04_hypothesis_tests/figure-pdf/fig-shape-of-t-1.pdf index 2beb0e083fd409e8c6821182d3152b4f4f917950..31bf72b111b8b6d93b9d4912f1714e49d7254e62 100644 GIT binary patch delta 169 zcmca_aNl6VG7cppGa~~-3u6mSE`8tp6qm%3R0RzeD+>D%!Or0#9*TOiauzjUA1gjO-L_2r7wHu(RVT ZE=epZsVGWK<1)1{GUifMb@g}S0suGfDUSdE delta 169 zcmca_aNl6VG7cpJGec7Y6JrZaE`8tp6qm%3R0RzeDG98RJY{EG^B=oE$Bk99@kqoQ=$!+>DJ)j4WL&9gW@0&5Z06YzQieRj{+; aDlSPZDyb++P2)1PFf!&+Rdw}u;{pIo;3?+- diff --git a/_freeze/04_hypothesis_tests/figure-pdf/fig-size-power-1.pdf b/_freeze/04_hypothesis_tests/figure-pdf/fig-size-power-1.pdf index b39319d20bd2567428430c1610e6df5408ccc952..f5a5510089a9230406784a72289908f9a550155b 100644 GIT binary patch delta 166 zcmZ4EvBqP=IUyw@Ga~~-3u6mSE`8tp6qm%3R0RzeDIMetc$CjnG-YyjGqyBzaW!%_Hg_>}votYsb~HC~ws0~tH!yNGH8OUyQ?MbVWU`=& FGynijD5?Me delta 166 zcmZ4EvBqP=IUywjGec7Y6JrZaE`8tp6qm%3R0RzeDIMetc$CjnG-Yyjbuut>GdDChFgI{DcXTl{bTKh9votevbThRuF>y7wQ?MbVWU`=& FGyw6cC~g1% diff --git a/_freeze/04_hypothesis_tests/figure-pdf/fig-two-sided-1.pdf b/_freeze/04_hypothesis_tests/figure-pdf/fig-two-sided-1.pdf index cf6c1997f456471c3c1cb6a216039b2d5dc30b9f..31ced1b90ad67a01b7c274fe64d57fe73b37011b 100644 GIT binary patch delta 166 zcmZ2szrub)55JO;nUR5^g|USum%eX)ic4Zis)B}#m63swv7rG(Zu2_+K3+|jI$Z-3 zbpr!+JjzpLpE5d|TDqAVSh|`R8#=ogySTcV8<;tnTRIuKIh&iAIvP0JDcBHFGPze? F8UWEFDK`KB delta 166 zcmZ2szrub)55JOunW3qHiLr$ym%eX)ic4Zis)B}#m63swv7rG(Zu2_+K3+|jI$Z-3 zbpr!+JjzpLpE5c-x|z8+nK>F;8o8PoyP3IKxLOz*ngLC8voJGvb9S>+upy*ma<9BJ E0MJA!I{*Lx