-
Notifications
You must be signed in to change notification settings - Fork 23
/
design.qmd
512 lines (352 loc) · 42.2 KB
/
design.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
# Design-based Inference {#sec-design-based}
## Introduction
Quantitative analysis of social data has an alluring exactness to it. It allows us to estimate the average number of minutes of YouTube videos watched to the millisecond, and in doing so it gives us the aura of true scientists. But the advantage of quantitative analyses lies not in the ability to derive precise three-decimal point estimates; rather, quantitative methods shine because they allow us to communicate methodological goals, assumptions, and results in a (hopefully) common, compact, and precise mathematical language. It is this language that helps clarify *exactly* what researchers are doing with their data and why.
This dewy view of quantitative methods is unfortunately often at odds with how these methods are used in the real world. All too often we as researchers find some arbitrary data, apply a statistical tool with which we are familiar, and then shoehorn the results into a theoretical story that may or may not have a (tenuous) connection. Quantitative methods applied this way will provide us with a very specific answer to a murky question about a shapeless target.
This book is a guide to a better foundation for quantitative analysis and, in particular, for statistical inference. Inference is the task of using the data we have to learn something about the data we do not have.
The organizing motto of this book is to help us as researchers be
> Precise in stating our goals, transparent in stating our assumptions, and honest in evaluating our results.
These goals are the target of our inference – or what do we want to learn and about whom.
In pursuing these goals, this book will focus on a general workflow for statistical inference. The workflow boils down to answering a series of questions about the goals, assumptions, and methods of our analysis:
1. **Population**: who or what do we want to learn about?
2. **Design/model**: how will we collect the data, or, what assumptions are we making about how the data came to be?
3. **Quantity of Interest**: what do we want to learn about the population?
4. **Estimator**: how will we use the data to produce an estimate?
5. **Uncertainty**: how will we estimate and convey the error associated with the estimate?
These questions form the core of any quantitative research endeavor. And the answers to these will draw on a mixture of substantive interests, feasibility, and statistical theory, and this mixture will vary from question to question. For example, the population of interest can vary greatly from study to study, whereas many disparate studies may employ the same estimand and estimator.
The third core question is particularly important, since it highlights an essential division in how researchers approach statistical inference – specifically, **design-based inference** vs **model-based inference**. Design-based inference typically focuses on situations in which we have precise knowledge of how our sample was randomly selected from the population. Uncertainty here comes exclusively from the random nature of which observations are included in the sample. By contrast, in the **model-based** framework, we treat our data as random variables and propose a probabilistic model for how the data came to exist. The models then vary in the strength of their assumptions.
Design-based inference is the framework that addresses the core inferential questions most crisply, and so it is the focus of this chapter. Its main disadvantages are that it is considerably less general than the model-based approach and that the mathematics of the framework are slightly more complicated.
We will now go over each of the core questions in more detail.
## Question 1: Population
Inference is the task of using the data that we have to learn facts about the world (i.e., the data we do not have). The most straightforward setting is when we have a fixed set of units that we want to learn something about. These units are what we call the **population** or **target population**. We are going to focus on random sampling from this population, but, to do so, we need to have a list of units from the population. This list of $N$ units is called the **frame** or **sampling frame**, and we will index these units in the sampling frame by $i \in \mathcal{U} = \{1, \ldots, N\}$. Here we assume that $N$, the size of the population, is known, but note that this may not always be true.
The sampling frame may differ from the target population simply for feasibility reasons. For example, the target population might include all the households in a given city, but the sampling frame might be the list of all residential telephone numbers for that city. Of course, many households do not have landline telephones and rely on mobile phone exclusively. This gap between the target population and the sampling frame is called **undercoverage** or **coverage bias**.
::: {#exm-frame-bias}
An early but prominent example of frame bias in survey sampling is the infamous *Literary Digest* poll of the 1936 U.S. presidential election. *Literary Digest*, a (now defunct) magazine, sent over 10 million ballots to addresses found in automobile registration lists and telephone books, trying to figure out who would win the important 1936 presidential race. The sample size was huge: over 2 million respondents. In the end, the results predicted that Alf Landon, the Republican candidate, would receive 55% of the vote, while the incumbent, Democratic President Franklin D. Roosevelt, would only win 41% of the vote. Unfortunately for the *Literary Digest*, Landon only received 37% of the vote.
There are many possible reasons for this massive polling error. Most obviously, the sampling frame was different from that of the target population. Why? Only those with either a car or a telephone were included in the sampling frame, and people without either overwhelmingly supported the Democrat, Roosevelt. While this is not the only source of bias – differential nonresponse seems to be a particularly big problem –the frame bias contributes a large part of the error. For more about this poll, see @Squire88.
:::
One advantage of design-based inference is how precisely we must articulate the sampling frame. We can be extremely clear about the group of units we are trying to learn about. We shall see that in model-based inference the concept of the population and sampling frame become more amorphous.
::: {#exm-anes-population}
## American National Election Survey, Population
According to the materials from the American National Election Survey (ANES) in 2012, its target population is all U.S. citizens age 18 or older. The sampling frame for the face-to-face portion of the survey "consisted of the Delivery Sequence File (DSF) used by the United States Postal Service" for residential delivery of mail." Unfortunately, there are housing units that are covered by mail delivery by the postal service which would result in the potential for frame bias. The designers of the ANES used the Decennial Census to add many of these units to the final sampling frame.
:::
## Question 2: Sampling design
Now that we have a clearly defined population and sampling frame, we can consider how to select a sample from the population. We will focus on **probabilistic samples**, where units are selected into the sample by chance, and each unit in the sampling frame has a non-0 probability of being included. Let $\mathcal{S} \subset \mathcal{U}$ be a sample and let $\mb{Z} = (Z_1, Z_2, \ldots, Z_N)$ to be a vector of inclusion indicators such that $Z_i = 1$ if $i \in \mathcal{S}$ and $Z_i = 0$ otherwise. We denote these indicators as upper-case letters because they are random variables. We assume the sample size is $|\mathcal{S}| = n$.
Suppose our sampling frame was the hobbits who are members of the Fellowship of the Ring, an exclusive group brought into being by a wizened elf lord. This group of four hobbits is a valid – albeit small and fictional population – with $\mathcal{U} =$ \{Frodo, Sam, Pip, Merry\}.
Suppose we want to sample two hobbits from this group. We can list all six possible samples of size two from this population in terms of the sample members $\mathcal{S}$ or, equivalently, the inclusion indicators $\mb{Z}$:
- $\mathcal{S}_1 =$ \{Frodo, Sam\} with $\mb{Z}_{1} = (1, 1, 0, 0)$
- $\mathcal{S}_2 =$ \{Frodo, Pip\} with $\mb{Z}_{2} = (1, 0, 1, 0)$
- $\mathcal{S}_3 =$ \{Frodo, Merry\} with $\mb{Z}_{3} = (1, 0, 0, 1)$
- $\mathcal{S}_4 =$ \{Sam, Pip\} with $\mb{Z}_{4} = (0, 1, 1, 0)$
- $\mathcal{S}_5 =$ \{Sam, Merry\} with $\mb{Z}_{5} = (0, 1, 0, 1)$
- $\mathcal{S}_6 =$ \{Pip, Merry\} with $\mb{Z}_{6} = (0, 0, 1, 1)$
A **sampling design** is a complete specification of how likely to be selected each of these samples is. That is, we need to determine a selection probability $\pi_j$ for each sample $\mathcal{S}_j$. The most widely used and widely studied design is one that places equal probability on each of the possible samples of size $n$.
::: {#def-srs}
A **simple random sample** (srs) is a probability sampling design where each possible sample of size $n$ has the same probability
of occurring. More specifically, let $\mb{z} = (z_{1}, \ldots, z_{N})$ be a particular possible sampling, then,
$$
\P(\mb{Z} = \mb{z}) = \begin{cases}
{N \choose n}^{-1} &\text{if } \sum_{i=1}^N z_i = n,\\
0 & \text{otherwise}
\end{cases}
$$
:::
If we sampled two hobbits, the srs (the simple random sample) would place $1/{4\choose 2} = 1/6$ probability of each of the above samples $\mathcal{S}_j$. Note that the srs gives zero probability to any sample that does not have exactly $n$ units in the sample.
Another common sampling design –the **Bernoulli sampling** design – works by choosing each unit independently with the same probability.
::: {#def-srs}
**Bernoulli sampling** is a probability sampling design where independent Bernoulli trials with probability of success $q$ determine whether each unit in the population will be included in the sample. More specifically, let $\mb{z} = (z_{1}, \ldots, z_{N})$ be a particular possible sampling. Bernoulli sampling will then be
$$
\P(\mb{Z} = \mb{z}) = \P(Z_1 = z_1) \cdots \P(Z_N = z_N) = \prod_{i=1}^N q_i^{Z_i}(1 - q_i)^{1-Z_i}
$$
:::
Bernoulli sampling is very straightforward because independently selecting units simplifies many calculations. However, this "coin flipping" approach means that the sample size, $N_s = \sum_{i=1}^N Z_i$, will be itself a random variable because it is the result of how many of the coin flips land on "heads."
Simple random samples and Bernoulli random samples are simple to understand and implement. For large surveys, the sampling designs are often much more complicated for cost-saving reasons. We now describe the sampling design for the ANES, which contains many design features typical of similar large surveys.
::: {#exm-anes-design}
## American National Election Survey, Sampling Design
The ANES uses a typical yet complicated design for its 2012 face-to-face survey. First, the designers divided (or stratified) U.S. states into nine Census divisions (which are based on geography). Within each division, designers then randomly sampled a number of census tracts (with higher number of sampled tracts for divisions with higher populations). The census tracts with larger populations are selected with higher probability.
The second stage randomly samples addresses from the sampling frame (described in @exm-anes-population). More households were sampled from tracts with higher proportion of Black and Latino residents to obtain an oversample of these groups.
Finally, the third stage of sampling was to randomly select one eligible person per household for completion of the survey.
:::
## Question 3: Quantity of Interest
The **quantity of interest** is a numerical summary of the population that we want to learn about. These quantities are also called **estimands** ( Latin for "the thing to be estimated").
Let $x_1, x_2, \ldots, x_N$ be a fixed set of characteristics, or items, about the population. Using the statistician's favorite home decor, we might think about our population as a set of marbles in a jar where the $x_i$ values indicate, for example, the color of the $i$-th marble. In a survey, $x_i$ might represent the age, ideology, or income of the $i$-th person in the population.
We can define many useful quantities of interest based on the population characteristics. These quantities generally summarize the values $x_1, \ldots, x_N$. One of the most common, and certainly one of the most useful, is the **population mean**, defined as
$$
\overline{x} = \frac{1}{N} \sum_{i=1}^N x_i.
$$
The population mean is fixed because $N$ and the population characteristics $x_1, \ldots, x_N$ are fixed. Another common estimand in the survey sampling literature is the population total,
$$
t = \sum_{i=1}^N x_i = N\overline{x}.
$$
::: {#exm-subpopulation}
## Subpopulation means
We may also be interested in quantities for different subdomains. Suppose we are interested in estimating the fraction of (say) conservative-identifying respondents who support increasing legal immigration. Let $d= 1, \ldots, D$ be the number of subdomains or subpopulations. In this case, we might have $d = 1$ as liberal identifiers, $d = 2$ as moderate identifiers, and $d = 3$ as conservative identifiers. We will refer to the subpopulation for each of these groups as $\mathcal{U}_d \subset \{1,\ldots, N\}$ and we define the size of these groups as $N_d = |\mathcal{U}_d$. So, $N_3$ would be the number of conservative-identifying citizens in the population.
The mean for each group is then
$$
\overline{x}_d = \frac{1}{N_d} \sum_{i \in \mathcal{U}_d} x_i.
$$
Subpopulation estimation can be slightly more complicated than population estimation because we may not know who is in which subpopulation until we actually sample the population. For example, our sampling frame probably may not information about ‘potential respondents’ ideology. Thus, $N_d$ will be unknown to the researcher, unlike $N$ for the population mean, which is known.
:::
We may be interested in many other quantities of interest, but design-based inference is largely focused on these types of population and subpopulation means and totals.
## Question 4: Estimator
Now that we have a sampling design and a quantity of interest, we can consider what we can learn about this quantity of interest from our sample. An **estimator** is a function of the sample measurements intended as a best guess about our quantity of interest.
If the most common estimand is the population mean, the most popular estimator is the **sample mean**, defined as
$$
\overline{X}_n = \frac{1}{n} \sum_{i=1}^{N}Z_ix_i
$$
The sample mean is a **random** quantity since it varies from sample to sample, and those samples are chosen probabilistically. For example, suppose we have height measurements from our small population of hobbits in @tbl-hobbit-pop.
| Unit ($i$) | Height in cm ($x_i$) |
|:-----------|:---------------------|
| 1 (Frodo) | 124 |
| 2 (Sam) | 127 |
| 3 (Pip) | 123 |
| 4 (Merry) | 127 |
: A small population of hobbits {#tbl-hobbit-pop}
If we consider a simple random sample of size $n=2$ from this population, we can list the probability of all possible sample means associated with this sampling design as we do in @tbl-hobbit-samples. @tbl-sampling-dist combines the equivalent values of the sample mean to arrive at the **sampling distribution** of the sample mean of hobbit height under a srs of size 2.
| Sample ($j$) | Probability ($\pi_j$) | Sample mean ($\overline{X}_n$) |
|:-----------------|:----------------------|:-------------------------------|
| 1 (Frodo, Sam) | 1/6 | (124 + 127) / 2 = 125.5 |
| 2 (Frodo, Pip) | 1/6 | (124 + 123) / 2 = 123.5 |
| 3 (Frodo, Merry) | 1/6 | (124 + 127) / 2 = 125.5 |
| 4 (Sam, Pip) | 1/6 | (127 + 123) / 2 = 125 |
| 5 (Sam, Merry) | 1/6 | (127 + 127) / 2 = 127 |
| 6 (Pip, Merry) | 1/6 | (123 + 127) / 2 = 125 |
: All possible simple random samples of size 2 from the hobbit population {#tbl-hobbit-samples}
| Sample mean | Probability |
|-------------|-------------|
| 123.5 | 1/6 |
| 125 | 1/3 |
| 125.5 | 1/3 |
| 127 | 1/6 |
: Sampling distribution of the sample mean for simple random samples of size 2 from the hobbit population {#tbl-sampling-dist}
Thus, the sampling distribution tells us what values of an estimator are more or less likely and depends on both the population distribution and the sampling design.
::: {.callout-note}
Notice that the sampling distribution of an estimator will depend on the sampling design. Here, we used a simple random sample. Bernoulli sampling would have produced a different distribution. Using Bernoulli sampling, we could end up with a sample of just Frodo, in which case the sample mean would be his height (124cm), a sample mean value that is impossible with simple random sampling of size $n=2$.
:::
### Properties of the sampling distribution of an estimator
Generally speaking, we want "good" estimators. But what makes an estimator “good’’? The best estimator would obviously be the one that is right all of the time ($\Xbar_n = \overline{x}$ with probability 1), but this is only possible if we conduct a census –that is, sample everyone in the population – or the population does not vary. Neither situation is typical for most researchers.
We instead focus on properties of the sampling distribution of an estimator. The following types of questions get at these properties:
- Are the estimator's observed values (realizations) centered on the true value of the quantity of interest? (unbiasedness)
- Is there a lot or a little variation in the realizations of the estimator across different samples from the population? (sample variance)
- On average, how close to the truth is the estimator? (mean square error)
The answers to these questions will depend on (a) the estimator and (b) the sampling design.
To back up, the sampling distribution shows us all the possible values of an estimator across different samples from the population. If we want to summarize this distribution with a single number, we would focus on its expectation, which is a measure of central tendency of the distribution. Roughly speaking, we want the center of the distribution to be close to and ideally equal to the true quantity of interest. If this is not the case, that means the estimator systematically over- or under-estimates the truth. We call this difference the **bias** of an estimator, which can be written mathematically as
$$
\textsf{bias}[\Xbar_{n}] = \E[\Xbar_{n}] - \overline{x}.
$$
Any estimator that has bias equal to zero is call an **unbiased** estimator.
We can calculate the bias of our hobbit srs (where we sampled two hobbits from the Fellowship of the Ring with equal probability) by first calculating the expected value of the estimator,
$$
\E[\Xbar_{n}] = \frac{1}{6}\cdot 123.5 + \frac{1}{3} \cdot 125 + \frac{1}{3} \cdot 125.5 + \frac{1}{6} \cdot 127 = 125.25,
$$
and comparing this to the population mean,
$$
\overline{x} = \frac{1}{4}\left(124 + 127 + 123 + 127\right) = 125.25.
$$
The two are the same, meaning the sample mean in this simple random sample is unbiased.
::: {.callout-warning}
Note that the word "bias" sometimes also refers to research that is systematically incorrect in other ways. For example, we might complain that a survey question is biased if it presents a leading or misleading question or if it mismeasures the concept of interest. To see this, suppose we wanted to estimate the proportion of a population that regularly donates money to a political campaign, but $x_i$ actually measures whether a person donated on the day of the survey. In this case, $\overline{x}$ would be quite a bit lower than the quantity of interest because it only captures one day of donation patterns, not regular donations made over time. Textbooks often refer to this gap between the measures we obtain and the measures we want as **measurement bias**. This is distinct from the bias of the sample mean. Using our donations example, taking an srs from the population of daily donors, $\Xbar_{n}$ would still result in an unbiased estimate for $\overline{x}$, even if that is entirely the wrong quantity of interest.
:::
Is the unbiasedness of our hobbit sampling unique to this example? Thankfully no. We can prove that the sample mean will be unbiased for the population mean under a simple random sample. Relying on the definition of the sample mean, we can obtain:
$$
\E[\Xbar_{n}] = \E\left[\frac{1}{n} \sum_{i=1}^{N} Z_{i}x_{i}\right] = \frac{1}{n} \sum_{i=1}^{N} \E[Z_{i}]x_{i} = \frac{1}{n} \sum_{i=1}^{N} \frac{n}{N}x_{i} = \frac{1}{N} \sum_{i=1}^{N}x_{i} = \overline{x}
$$
Using $\E[Z_i] = n/N$ for the simple random sample in the second equality is key. Intuitively, the probability of being included in the sample is simply the fraction of the sample being selected, $n/N$.
The second salient feature of an estimator's sampling distribution is its spread. Generally speaking, we prefer an estimator whose estimates are very similar from sample to sample over an estimator whose estimates vary wildly from one sample to the next. We quantify this spread with the **sampling variance**, which is simply the variance of the sampling distribution of the estimator, or
$$
\V[\Xbar_n] = \E[(\Xbar_n - \E[\Xbar_n])^2].
$$
An alternative measure of spread is the **standard error** of the estimator, which is the square root of the sampling variance,
$$
\se[\Xbar_n] = \sqrt{\V[\Xbar_n]}.
$$
The standard error is often more interpretable because it is on the same scale as the original variable. Using our hobbits’ heights example, the sampling variance would be measured in centimeters squared but the standard error would be measured in centimeters and, thus, easier to interpret.
The final important property is the **mean squared error** or **MSE**, which (as its name implies) measures the average of the squared error:
$$
\text{MSE} = \E[(\Xbar_n-\overline{x})^2].
$$
Keen-eyed readers might find this quantity redundant because, as we showed above, the sample mean is unbiased, so $\E[\Xbar_n] = \overline{x}$. This, in turn, means that the sampling variance of the sample mean is just the mean squared error. However, circumstances will often conspire to make us use biased estimators, so these two quantities will differ. In fact, if we have an estimator $\widehat{\theta}$ for some population quantity $\theta$,
$$
\begin{aligned}
\text{MSE}[\widehat{\theta}] &= \E[(\widehat{\theta} - \theta)^2] \\
&= \E[(\widehat{\theta} - \E[\widehat{\theta}] + \E[\widehat{\theta}] - \theta)^2] \\
&= \E[(\widehat{\theta} - \E[\widehat{\theta}])^2] + \left(\E[\widehat{\theta}] - \theta\right) ^ 2 + 2\E[(\widehat{\theta} - \E[\widehat{\theta}])]\left(\E[\widehat{\theta}] - \theta\right) \\
&= \text{bias}[\widehat{\theta}_n]^2 + \V[\widehat{\theta}_n]
\end{aligned}
$$
Thus, the MSE is low when bias and variance are low.
Note that connecting these concepts to notions of precision and accuracy is useful. In particular, estimators with low sampling variance are **precise**, whereas estimators with low MSE are **accurate**. An estimator can be very precise, but the same estimator can be inaccurate because it is biased.
## Question 5: Uncertainty
We now have a population, a quantity of interest, a sampling design, an estimator, and, with data, an actual estimate. But if we sampled, say, Sam and Merry from the hobbit population and obtained a sample mean height of 127, a reasonable worry would be that different samples – for example, Sam and Frodo or Merry and Pippin – would give us a different sample mean. So is the estimate of 127 inches that we get from our sample of Sam and Merry close to the true population mean? We cannot truly know without conducting a complete census of all four hobbits, which would render our sampling pointless. Can we instead figure out how far we might be from the truth – i.e., the true population mean? The sampling variance addresses this exact question, but the sampling variance depends on the sampling distribution, and we only have a single sample draw from this distribution, which gave the estimate of 127.
If we have a specific estimator and a sampling design, we can usually derive an analytical expression for the sampling variance (and, thus, the standard error), which in turn will identify the factors influencing the sampling variance. To aid in this endeavor, we need to define an additional feature of the population distribution, the **population variance**,
$$
s^{2}= \frac{1}{N-1} \sum_{i=1}^{N} (x_{i} - \overline{x})^{2}.
$$
The population variance measures the spread of the $x_i$ values in the population. As such it is a fixed quantity and not a random variable.
We now write the sampling variance of $\Xbar_n$ under simple random sampling as
$$
\V[\Xbar_{n}] = \left(1 - \frac{n}{N}\right) \frac{s^{2}}{n}
$$
Several features stand out from this expression. First, if the data $x_i$ is more spread out in the population, the sample mean will also be more spread out. Second, the larger the sample size, $n$, the smaller the sampling variance (for a fixed population size). Third, the larger the population size, $N$, the smaller the sampling variance (again for a fixed sample size).
### Deriving the sampling variance of the sample mean
How did we obtain this expression for the sampling variance under simple random sampling? It would be tempting to simply say "someone else proved it for me," but blind faith in statistical theory limits our own understanding of this situation and the ability to navigate novel scenarios that routinely arise in research.
To derive the sampling variance of the sample mean, let's begin with a simple application of the rules of variance that would be valid for any sampling design:
$$
\V[\Xbar_{n}] = \V\left[\frac{1}{n} \sum_{i=1}^N x_iZ_i\right] = \frac{1}{n^2}\left[\sum_{i=1}^N x_i^2\V[Z_i] + \sum_{i=1}^N\sum_{j\neq i} x_ix_j\cov[Z_i,Z_j]\right].
$$
Note in the second equality that the $x_i$ and $x_j$ values come out of the variance and covariance operators as if they are constants. This is because, in design-based inference, they are exactly constants. The only source of variation and uncertainty comes from the sampling, indicated by the inclusion indicators, $Z_i$. To make progress, we need to know the variance and covariance of these inclusion indicators. Recall that the variance of a binary indicator with probability $p$ of being 1 is $p(1 - p)$. So if $\P(Z_i = 1) = n/N$ for a simple random sample, then
$$
\V[Z_i] = \frac{n}{N}\left(1 - \frac{n}{N}\right) = \frac{n(N - n)}{N^2}.
$$
If you are used to the "independent and identically distributed" framework (to which we will turn in the next chapter), the covariances in the sampling variances might surprise. Aren’t units usually assumed to be independent? While this assumption would (and will) make our math lives easier, it is not true for the simple random sample. The srs samples units without replacement, which implies that units’ inclusion into the sample is not independent---knowing that unit $i$ was included in the sample means that another unit $j$ has only a $(n-1)/(N-1)$ probability of being included in the sample. To derive an expression for the covariance, note that $\cov(Z_i, Z_j) = \E[Z_iZ_j] - \E[Z_i]\E[Z_j]$ and
$$
\E[Z_iZ_j] = \P(Z_i = 1, Z_j = 1) = \P(Z_i = 1)\P(Z_j =1 \mid Z_i = 1) = \frac{n}{N}\cdot \frac{n-1}{N-1}.
$$
Plugging this into our covariance statement, we get
$$
\begin{aligned}
\cov(Z_i, Z_j) &= \E[Z_iZ_j] - \E[Z_i]\E[Z_j] \\ &= \frac{n}{N}\cdot \frac{n-1}{N-1} - \frac{n^2}{N^2}. \\
&=\frac{n}{N}\left(\frac{n-1}{N-1} - \frac{n}{N}\right) \\
&= \frac{n}{N}\left(\frac{Nn-N - Nn + n}{N(N-1)}\right) \\
&= -\frac{n(N- n)}{N^2(N-1)} \\
&= -\frac{\V[Z_i]}{N-1}.
\end{aligned}
$$
Given that variances and population sizes must be positive, the covariance between the inclusions of two units is negative. Going back to our hobbits, there are a fixed number of spots in the sample, and so Frodo being included lowers the chance that Sam is included, so we end up with this negative covariance.
With the covariance and variance now calculated, we can derive the sampling variance of the sample mean:
$$
\begin{aligned}
\V[\Xbar_{n}] &= \frac{1}{n^2}\left[\sum_{i=1}^N x_i^2\V[Z_i] + \sum_{i=1}^N\sum_{j\neq i} x_ix_j\cov[Z_i,Z_j]\right] \\
&= \frac{1}{n^2}\left[\sum_{i=1}^N x_i^2\V[Z_i] - \frac{1}{N-1} \sum_{i=1}^N\sum_{j\neq i} x_ix_j\V[Z_i]\right] \\
&= \frac{\V[Z_i]}{n^2}\left[\sum_{i=1}^N x_i^2 - \frac{1}{N-1} \sum_{i=1}^N\sum_{j\neq i} x_ix_j\right] \\
&= \frac{N-n}{nN^2}\left[\sum_{i=1}^N x_i^2 - \frac{1}{N-1} \sum_{i=1}^N\sum_{j\neq i} x_ix_j\right]
\end{aligned}
$$
Where do we go from here? Unfortunately, we have arrived at the non-obvious and seemingly magical step of "adding and subtracting a crucial quantity." (One needs to know the step before completing the proof, so how could you complete the proof without knowing this step?) In this case, it is necessary to add and subtract the quantity $(N-1)^{-1} \sum_{i=1}^N x_i^2$. To see why, rewrite the population variance in a slightly different way:
$$
s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{x})^2 = \frac{1}{N-1} \left(\sum_{i=1}^N x_i^2 - N\overline{x}^2\right)
$$
Note that we can write
$$
N^2\overline{x}^2 = \sum_{i=1}^N x_i^2 + \sum_{i=1}^N \sum_{j\neq i} x_ix_j,
$$
which provides a hint as to the quantity that we will add and subtract
$$
\begin{aligned}
\V[\Xbar_{n}] &= \frac{N-n}{nN^2}\left[\sum_{i=1}^N x_i^2 \textcolor{red!50}{\underbrace{+ \frac{1}{N-1}\sum_{i=1}^N x_i^2 - \frac{1}{N-1}\sum_{i=1}^N x_i^2}_{\text{add and subtract}}} - \frac{1}{N-1} \sum_{i=1}^N\sum_{j\neq i} x_ix_j\right] \\
&= \frac{N-n}{nN^2}\left[\frac{N}{N-1}\sum_{i=1}^N x_i^2 - \frac{1}{N-1} \sum_{i=1}^N\sum_{j=i}^N x_ix_j\right] \\
&= \frac{N-n}{nN^2}\left[\frac{N}{N-1}\sum_{i=1}^N x_i^2 - \frac{N^2}{N-1} \overline{x}\right] \\
&= \frac{N-n}{nN(N-1)}\left[\sum_{i=1}^N x_i^2 - N \overline{x}\right] \\
&= \frac{N-n}{nN(N-1)}\sum_{i=1}^N (x_i - \overline{x})^2 \\
&= \frac{(N-n)}{N}\frac{s^2}{n} = \left(1 - \frac{n}{N}\right)\frac{s^2}{n}.
\end{aligned}
$$
This proof is rather involved but does display some commonly used approaches to deriving statistical results. It also highlights how the sampling scheme leads to dependence, making the result more complicated. The next chapter will discuss how the variance of the sample mean under independent and identically distributed sampling is much simpler.
### Estimating the sampling variance
An unfortunate aspect of the sampling variance, $\V[\Xbar_n]$, is that it depends on the population variance, $s^2$, which we cannot know unless we have a census of the entire population (If we had that information, we would not need to worry about uncertainty.) Thus, we need to estimate the sampling variance. Since we already know $n$, the sample size, and $N$, the population size, the most straightforward way to do this is to find an estimator for the population variance.
A good estimator for this is the **sample variance**, which is simply the variance formula applied to the sample itself,
$$
S^2 = \frac{1}{n-1} \sum_{i=1}^N Z_i(x_i - \Xbar_n)^2.
$$
We can obtain an estimator for the sampling variance by substituting this in for the population variance,
$$
\widehat{\V}[\Xbar_n] = \left(1 - \frac{n}{N}\right)\frac{S^2}{n}.
$$
::: {.callout-warning}
## Mind your variances
It is easy to get confused about the difference between the population variance, the variance of the sample, and the sampling variance (just as it is to get confused about the population, the distribution of the sample, and the sampling distribution). Adding to the confusion, these are all variances but for very distinct distributions.
:::
Why is $\widehat{\V}[\Xbar_n]$ a “good” estimator for $\V[\Xbar_{n}]$? To answer this, we apply the same criteria as above in Question 4. Ideally, the estimator would be unbiased, meaning it does not systematically over- or underestimate how much variation is in the sample mean across repeated samples.
$$
\begin{aligned}
\E[S^2] &= \frac{1}{n-1} \sum_{i=1}^N \E[Z_i(x_i - \Xbar_n)^2] \\
&= \frac{1}{n-1} \E\left[\sum_{i=1}^N Z_i(x_i - \overline{x} - (\Xbar_n - \overline{x}))^2\right] \\
&= \frac{1}{n-1} \E\left[\sum_{i=1}^N Z_i(x_i - \overline{x})^2 -2Z_i(x_i - \overline{x})(\Xbar_n - \overline{x}) + Z_i(\Xbar_n - \overline{x})^2\right] \\
\end{aligned}
$$
Notice that $(\Xbar_n - \overline{x})$ does not depend on $i$ so we can pull it out of the summations:
$$
\begin{aligned}
\E[S^2] &= \frac{1}{n-1} \E\left[\sum_{i=1}^N Z_i(x_i - \overline{x})^2 -2(\Xbar_n - \overline{x}) \sum_{i=1}^N Z_i(x_i - \overline{x}) + (\Xbar_n - \overline{x})^2 \sum_{i=1}^N Z_i\right] \\
&= \frac{1}{n-1} \E\left[\sum_{i=1}^N Z_i(x_i - \overline{x})^2 -2n(\Xbar_n - \overline{x})^2 + n(\Xbar_n - \overline{x})^2\right] \\
&= \frac{1}{n-1} \left[\sum_{i=1}^N \E[Z_i] (x_i - \overline{x})^2 -n\E[(\Xbar_n - \overline{x})^2]\right] \\
&= \frac{n}{N}\frac{1}{n-1}\sum_{i=1}^N (x_i - \overline{x})^2 -\frac{n}{n-1}\V[\Xbar_n] \\
&= \frac{n(N-1)}{N(n-1)}s^2 -\frac{(N-n)}{N(n-1)}s^2 \\
&= s^2
\end{aligned}
$$
This shows that the sample variance is unbiased for the population variance. To complete the derivation, we can just plug this into the estimated sampling variance,
$$
\E\left[\widehat{\V}[\Xbar_n]\right] = \left(1 - \frac{n}{N}\right)\frac{\E\left[S^2\right]}{n} = \left(1 - \frac{n}{N}\right)\frac{s^2}{n} = \V[\Xbar_n],
$$
which establishes that the estimator is unbiased.
## Stratified sampling and survey weights
True to its name, the simple random sample is perhaps the most straightforward way to take a random sample of a fixed size. With more information about the population, however, we might obtain better estimates of the population quantities by incorporating this information into the sampling scheme. We can do this by conducting a **stratified random sample**, where we divide up the population into several strata (or groups) and conduct simple random samples within each stratum. We create these strata (or "stratify the population" in the usual jargon) based on the additional information about the population.
Consider an expanded population of the entire Fellowship of the Ring, which included 9 adventurous members – the four hobbits plus two humans (Aragorn and the doomed Boromir), an elf (Legolas of the Woodland Elves), a dwarf (Gimli), and a wizard (Gandalf the Grey)
| Unit ($i$) | Race | Height in cm ($x_i$) |
|:------------|:-------|:---------------------|
| 1 (Frodo) | Hobbit | 124 |
| 2 (Sam) | Hobbit | 127 |
| 3 (Pip) | Hobbit | 123 |
| 4 (Merry) | Hobbit | 127 |
| 5 (Gimli) | Dwarf | 137 |
| 6 (Gandalf) | Wizard | 168 |
| 7 (Aragorn) | Human | 198 |
| 8 (Boromir) | Human | 193 |
| 9 (Legolas) | Elf | 183 |
If we were taking a sample of size 5 from this population, we could use a simple random sample, but note that the sample could be lopsided. We could, for instance, sample mostly or all non-hobbits. We could instead conduct stratified sampling here by splitting our population into two strata: hobbits and non-hobbits, making up 4/9ths $\approx$ 44% and 5/9ths $\approx$ 56% of the population, respectively. To get to a sample of 5, we could take simple random samples of size 2 for the hobbits and size 3 for the non-hobbits. This would guarantee our sample would be 40% hobbit every time while still maintaining randomness in our selection of which hobbits and non-hobbits go into the sample.
Another reason to conduct a stratified random sample is to guarantee a level of precision for a certain subgroup of the population. Social science researchers often conduct nationally representative surveys but have a specific interest in obtaining estimates for certain minority populations – for example, African Americans, Latinos, people who are LGBTQ+, and others. In modest sample sizes, the number of respondents in one of these groups might be too small to learn much about their opinions. Sampling a higher proportion of the group of interest will help ensure that we can make precise statements about that group.
In a simple random sample, we have $\pi_i = n/N$ for all $i$. By contrast, stratified random sampling is an example of a broad class of sampling methods that have unequal inclusion probabilities, which we denote $\pi_i = \P(Z_{i} = 1)$. In the Fellowship of the Ring example, we were sampling 2 hobbits and 3 non-hobbits, so we have the following inclusion probabilities:
| Unit ($i$) | Race | Inclusion probability ($\pi_i$) |
|:------------|:-------|:--------------------------------|
| 1 (Frodo) | Hobbit | 0.5 |
| 2 (Sam) | Hobbit | 0.5 |
| 3 (Pip) | Hobbit | 0.5 |
| 4 (Merry) | Hobbit | 0.5 |
| 5 (Gimli) | Dwarf | 0.6 |
| 6 (Gandalf) | Wizard | 0.6 |
| 7 (Aragorn) | Man | 0.6 |
| 8 (Boromir) | Man | 0.6 |
| 9 (Legolas) | Elf | 0.6 |
There are additional ways to conduct a random sample with unequal inclusion probabilities. For example, suppose the goal is to randomly sample 5 U.S. cities for study. We might want to bias our sample toward larger cities in order to capture a larger number of citizens in the overall sample. If the number of inhabitants for city $i$ is $b_i$, then our inclusion probabilities for sampling with replacement[^pps] is
$$
\pi_i = \frac{b_i}{\sum_{i=1}^N b_i}.
$$
Note that we use information about the population in our sampling design, though this information is continuous whereas the information in the stratified estimator is discrete.
[^pps]: This description is true for sampling with replacement. When sampling without replacement, we would need to adjust the probabilities to account for how being selected first means that a unit cannot be selected second.
Using a sampling design with unequal inclusion probabilities means that we have changed our sampling design (question 3), but the population and estimands (questions 1 and 2) remain the same. We are still interested in estimating the population mean, $\overline{x}$. We now turn to the estimator (question 4), since we will need to use a new estimator that matches the design.
Two estimators are commonly used to estimate the population mean when sampling with unequal inclusion probabilities. The first, the **Horvitz-Thompson (HT) estimator**, has the form
$$
\widetilde{X}_{HT} = \frac{1}{N} \sum_{i=1}^{N} \frac{Z_{i}x_{i}}{\pi_{i}},
$$
This takes the weighted average of those in the sample, with the weights being the inverse of the inclusion probabilities. This is why the estimator is sometimes called the inverse probability weighting, or IPW, estimator.
We can show that the HT estimator is unbiased for the population mean by noting that $\E[Z_i] = \P(Z_i = 1) = \pi_i$, so that
$$
\E[\widetilde{X}_{HT}] = \frac{1}{N} \sum_{i=1}^N \frac{\E[Z_i]x_i}{\pi_i} = \frac{1}{N} \sum_{i=1}^N x_i = \overline{x}.
$$
A downside of the HT estimator is that it can be unstable if a unit with a very small inclusion probability is selected since that unit’s weight ($1/\pi_i$) will be very large. This instability is the cost of being unbiased for the stratified design. Also note that the formula for the sampling variance is rather complicated and requires notation that is less important to the task at hand.
The second estimator for the the population mean when sampling with unequal inclusion probabilities is the **Hájek estimator**, which normalizes the weights so they sum to $N$ and has the form
$$
\widetilde{X}_{hj} = \frac{\sum_{i=1}^N Z_{i}x_{i} / \pi_{i}}{\sum_{i=1}^{N} Z_{i}/\pi_{i}}.
$$
This estimator is **biased** for the population mean since there is a random quantity in the denominator. The Hajek estimator is often considered the better estimator in many situations, though, because it has lower sampling variance than the HT estimator.
### Sampling weights
The HT and Hajek estimators are both functions of what are commonly called the **sampling weights**,
$$w_i = \frac{1}{\pi_i}$$.
We can write the HT estimator as
$$
\widetilde{X}_{HT} = \frac{1}{N} \sum_{i=1}^N w_iZ_ix_i,
$$
and we can write the Hajek estimator as
$$
\widetilde{X}_{hj} = \frac{\sum_{i=1}^N w_iZ_{i}x_{i}}{\sum_{i=1}^{N} w_iZ_{i}}.
$$
These weights, $w_i$, are usually included in final survey data sets because they contain all the information about the sampling design a researcher needs to analyze the survey responses even without knowledge of the exact design.[^var]
[^var]: If we want design-based estimators of the sampling variance, we would also need to know the joint inclusion probabilities, which are the probabilities of any two units being sampled together.
The sampling weights have a nice interpretation in terms of a pseudo-population: each unit in the sample "represents" $w_i = 1/\pi_i$ units in the population. This makes the sample more representative of the population.
Finally, note that statistical software often is a little confusing in how it handles weights. It may not be obvious what estimator function `weighted.mean(x, w)` in R is using. In fact, the source code basically calls
```{r}
#| eval: false
sum(x * w) / sum(w)
```
which is equivalent to the Hajek estimator above.
## Summary
This chapter covered the basic structure of design-based inference in the context of sampling from a population. We introduced the basic questions of statistical inference, including specifying a population and quantity of interest, choosing a sampling design and estimator, and assessing uncertainty of the estimator. Of course, we have only scratched the surface of the types of designs and estimators used in the practice of sampling. Professional probability surveys often use clustering, which means randomly selecting larger clusters of units and then randomly sampling within these units. However complex the sampling design, the core steps of design-based statistical inference remain the same. A researcher must identify a population, determine a sampling design, choose a quantity of interest, select an estimator, and describe the uncertainty of any estimates.