-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathusers-guide.tex
7061 lines (6032 loc) · 347 KB
/
users-guide.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Options for packages loaded elsewhere
\PassOptionsToPackage{unicode}{hyperref}
\PassOptionsToPackage{hyphens}{url}
\PassOptionsToPackage{dvipsnames,svgnames,x11names}{xcolor}
%
\documentclass[
13pt,
letterpaper,
DIV=11,
numbers=noendperiod]{scrreprt}
\usepackage{amsmath,amssymb}
\usepackage{iftex}
\ifPDFTeX
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{textcomp} % provide euro and other symbols
\else % if luatex or xetex
\usepackage{unicode-math}
\defaultfontfeatures{Scale=MatchLowercase}
\defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1}
\fi
\usepackage[]{cochineal}
\ifPDFTeX\else
% xetex/luatex font selection
\fi
% Use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\IfFileExists{microtype.sty}{% use microtype if available
\usepackage[]{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\makeatletter
\@ifundefined{KOMAClassName}{% if non-KOMA class
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}}
}{% if KOMA class
\KOMAoptions{parskip=half}}
\makeatother
\usepackage{xcolor}
\setlength{\emergencystretch}{3em} % prevent overfull lines
\setcounter{secnumdepth}{5}
% Make \paragraph and \subparagraph free-standing
\ifx\paragraph\undefined\else
\let\oldparagraph\paragraph
\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}}
\fi
\ifx\subparagraph\undefined\else
\let\oldsubparagraph\subparagraph
\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
\fi
\usepackage{color}
\usepackage{fancyvrb}
\newcommand{\VerbBar}{|}
\newcommand{\VERB}{\Verb[commandchars=\\\{\}]}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\usepackage{framed}
\definecolor{shadecolor}{RGB}{241,243,245}
\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.40,0.45,0.13}{#1}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\BuiltInTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}}
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\ExtensionTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.28,0.35,0.67}{#1}}
\newcommand{\ImportTok}[1]{\textcolor[rgb]{0.00,0.46,0.62}{#1}}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\NormalTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\RegionMarkerTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.07,0.07,0.07}{#1}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}}
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}\usepackage{longtable,booktabs,array}
\usepackage{calc} % for calculating minipage widths
% Correct order of tables after \paragraph or \subparagraph
\usepackage{etoolbox}
\makeatletter
\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{}
\makeatother
% Allow footnotes in longtable head/foot
\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}}
\makesavenoteenv{longtable}
\usepackage{graphicx}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
% Set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
\newlength{\cslhangindent}
\setlength{\cslhangindent}{1.5em}
\newlength{\csllabelwidth}
\setlength{\csllabelwidth}{3em}
\newlength{\cslentryspacingunit} % times entry-spacing
\setlength{\cslentryspacingunit}{\parskip}
\newenvironment{CSLReferences}[2] % #1 hanging-ident, #2 entry spacing
{% don't indent paragraphs
\setlength{\parindent}{0pt}
% turn on hanging indent if param 1 is 1
\ifodd #1
\let\oldpar\par
\def\par{\hangindent=\cslhangindent\oldpar}
\fi
% set entry spacing
\setlength{\parskip}{#2\cslentryspacingunit}
}%
{}
\usepackage{calc}
\newcommand{\CSLBlock}[1]{#1\hfill\break}
\newcommand{\CSLLeftMargin}[1]{\parbox[t]{\csllabelwidth}{#1}}
\newcommand{\CSLRightInline}[1]{\parbox[t]{\linewidth - \csllabelwidth}{#1}\break}
\newcommand{\CSLIndent}[1]{\hspace{\cslhangindent}#1}
\newcommand{\bs}{\symbf}
\newcommand{\mb}{\symbf}
\newcommand{\E}{\mathbb{E}}
\newcommand{\V}{\mathbb{V}}
\newcommand{\var}{\text{var}}
\newcommand{\cov}{\text{cov}}
\newcommand{\N}{\mathcal{N}}
\newcommand{\Bern}{\text{Bern}}
\newcommand{\Bin}{\text{Bin}}
\newcommand{\Pois}{\text{Pois}}
\newcommand{\Unif}{\text{Unif}}
\newcommand{\se}{\textsf{se}}
\newcommand{\au}{\underline{a}}
\newcommand{\du}{\underline{d}}
\newcommand{\Au}{\underline{A}}
\newcommand{\Du}{\underline{D}}
\newcommand{\xu}{\underline{x}}
\newcommand{\Xu}{\underline{X}}
\newcommand{\Yu}{\underline{Y}}
\renewcommand{\P}{\mathbb{P}}
\newcommand{\U}{\mb{U}}
\newcommand{\Xbar}{\overline{X}}
\newcommand{\Ybar}{\overline{Y}}
\newcommand{\real}{\mathbb{R}}
\newcommand{\bbL}{\mathbb{L}}
\renewcommand{\u}{\mb{u}}
\renewcommand{\v}{\mb{v}}
\newcommand{\M}{\mb{M}}
\newcommand{\X}{\mb{X}}
\newcommand{\Xmat}{\mathbb{X}}
\newcommand{\bfx}{\mb{x}}
\newcommand{\y}{\mb{y}}
\newcommand{\bfbeta}{\mb{\beta}}
\renewcommand{\b}{\symbf{\beta}}
\newcommand{\e}{\bs{\epsilon}}
\newcommand{\bhat}{\widehat{\mb{\beta}}}
\newcommand{\XX}{\Xmat'\Xmat}
\newcommand{\XXinv}{\left(\XX\right)^{-1}}
\newcommand{\hatsig}{\hat{\sigma}^2}
\newcommand{\red}[1]{\textcolor{red!60}{#1}}
\newcommand{\indianred}[1]{\textcolor{indianred}{#1}}
\newcommand{\blue}[1]{\textcolor{blue!60}{#1}}
\newcommand{\dblue}[1]{\textcolor{dodgerblue}{#1}}
\newcommand{\indep}{\perp\!\!\!\perp}
\newcommand{\inprob}{\overset{p}{\to}}
\newcommand{\indist}{\overset{d}{\to}}
\newcommand{\eframe}{\end{frame}}
\newcommand{\bframe}{\begin{frame}}
\newcommand{\R}{\textsf{\textbf{R}}}
\newcommand{\Rst}{\textsf{\textbf{RStudio}}}
\newcommand{\rfun}[1]{\texttt{\color{magenta}{#1}}}
\newcommand{\rpack}[1]{\textbf{#1}}
\newcommand{\rexpr}[1]{\texttt{\color{magenta}{#1}}}
\newcommand{\filename}[1]{\texttt{\color{blue}{#1}}}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\KOMAoption{captions}{tableheading}
\makeatletter
\@ifpackageloaded{tcolorbox}{}{\usepackage[skins,breakable]{tcolorbox}}
\@ifpackageloaded{fontawesome5}{}{\usepackage{fontawesome5}}
\definecolor{quarto-callout-color}{HTML}{909090}
\definecolor{quarto-callout-note-color}{HTML}{0758E5}
\definecolor{quarto-callout-important-color}{HTML}{CC1914}
\definecolor{quarto-callout-warning-color}{HTML}{EB9113}
\definecolor{quarto-callout-tip-color}{HTML}{00A047}
\definecolor{quarto-callout-caution-color}{HTML}{FC5300}
\definecolor{quarto-callout-color-frame}{HTML}{acacac}
\definecolor{quarto-callout-note-color-frame}{HTML}{4582ec}
\definecolor{quarto-callout-important-color-frame}{HTML}{d9534f}
\definecolor{quarto-callout-warning-color-frame}{HTML}{f0ad4e}
\definecolor{quarto-callout-tip-color-frame}{HTML}{02b875}
\definecolor{quarto-callout-caution-color-frame}{HTML}{fd7e14}
\makeatother
\makeatletter
\makeatother
\makeatletter
\@ifpackageloaded{bookmark}{}{\usepackage{bookmark}}
\makeatother
\makeatletter
\@ifpackageloaded{caption}{}{\usepackage{caption}}
\AtBeginDocument{%
\ifdefined\contentsname
\renewcommand*\contentsname{Table of contents}
\else
\newcommand\contentsname{Table of contents}
\fi
\ifdefined\listfigurename
\renewcommand*\listfigurename{List of Figures}
\else
\newcommand\listfigurename{List of Figures}
\fi
\ifdefined\listtablename
\renewcommand*\listtablename{List of Tables}
\else
\newcommand\listtablename{List of Tables}
\fi
\ifdefined\figurename
\renewcommand*\figurename{Figure}
\else
\newcommand\figurename{Figure}
\fi
\ifdefined\tablename
\renewcommand*\tablename{Table}
\else
\newcommand\tablename{Table}
\fi
}
\@ifpackageloaded{float}{}{\usepackage{float}}
\floatstyle{ruled}
\@ifundefined{c@chapter}{\newfloat{codelisting}{h}{lop}}{\newfloat{codelisting}{h}{lop}[chapter]}
\floatname{codelisting}{Listing}
\newcommand*\listoflistings{\listof{codelisting}{List of Listings}}
\usepackage{amsthm}
\theoremstyle{definition}
\newtheorem{example}{Example}[chapter]
\theoremstyle{definition}
\newtheorem{definition}{Definition}[chapter]
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[chapter]
\theoremstyle{remark}
\AtBeginDocument{\renewcommand*{\proofname}{Proof}}
\newtheorem*{remark}{Remark}
\newtheorem*{solution}{Solution}
\makeatother
\makeatletter
\@ifpackageloaded{caption}{}{\usepackage{caption}}
\@ifpackageloaded{subcaption}{}{\usepackage{subcaption}}
\makeatother
\makeatletter
\@ifpackageloaded{tcolorbox}{}{\usepackage[skins,breakable]{tcolorbox}}
\makeatother
\makeatletter
\@ifundefined{shadecolor}{\definecolor{shadecolor}{rgb}{.97, .97, .97}}
\makeatother
\makeatletter
\makeatother
\makeatletter
\makeatother
\ifLuaTeX
\usepackage{selnolig} % disable illegal ligatures
\fi
\IfFileExists{bookmark.sty}{\usepackage{bookmark}}{\usepackage{hyperref}}
\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available
\urlstyle{same} % disable monospaced font for URLs
\hypersetup{
pdftitle={A User's Guide to Statistical Inference and Regression},
pdfauthor={Matthew Blackwell},
colorlinks=true,
linkcolor={blue},
filecolor={Maroon},
citecolor={Blue},
urlcolor={Blue},
pdfcreator={LaTeX via pandoc}}
\title{A User's Guide to Statistical Inference and Regression}
\author{Matthew Blackwell}
\date{}
\begin{document}
\maketitle
\ifdefined\Shaded\renewenvironment{Shaded}{\begin{tcolorbox}[interior hidden, borderline west={3pt}{0pt}{shadecolor}, sharp corners, enhanced, frame hidden, breakable, boxrule=0pt]}{\end{tcolorbox}}\fi
\renewcommand*\contentsname{Table of contents}
{
\hypersetup{linkcolor=}
\setcounter{tocdepth}{2}
\tableofcontents
}
\bookmarksetup{startatroot}
\hypertarget{preface}{%
\chapter*{Preface}\label{preface}}
\addcontentsline{toc}{chapter}{Preface}
\markboth{Preface}{Preface}
\begin{figure}[th]
{\centering \includegraphics{assets/img/linear-approximation.png}
}
\end{figure}
This book, like many before it, will try to teach you statistics. The
field of statistics describes how we learn about the world using
quantitative data. In the social sciences, an increasing share of
empirical studies use statistical methods to provide evidence for or
against conceptual arguments. And, while it is possible to conduct
quantitative research without understanding statistics at an intuitive
level, it is not a good idea. Quantitative research involves a host of
\emph{choices} about the model to use, variables to include, tuning
parameters to set, assumptions to make, and so on. Without a deep
understanding of statistics, you may find these choices bewildering and
confusing, and you may simply (and possibly erroneously) yield to the
default settings of your statistical software.
The goal of this book is to give you the foundation to make
methodological choices for your specific application with knowledge and
with confidence. The material is intended for first-year PhD students in
political science, but it may be of interest more broadly.
We will focus on two key goals:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\item
\textbf{Understand the basic ways to assess estimators} With
quantitative data, we often want to make statistical inferences about
some unknown feature of the world. We use estimators (which are just
ways of summarizing our data) to estimate these features. This book
will introduce the basics of this task at a general enough level to be
applicable to almost any estimator that you are likely to encounter in
empirical research in the social sciences. We will also cover major
concepts such as bias, sampling variance, consistency, and asymptotic
normality, which are so common to such a large swath of (frequentist)
inference that understanding them at a deep level will yield an
enormous return on your time investment. Once you understand these
core ideas, you will have a language to analyze any fancy new
estimator that pops up in the next few decades.
\item
\textbf{Apply these ideas to the estimation of regression models} This
book will apply these ideas to one particular social science
workhorse: regression. Many methods either use regression estimators
like ordinary least squares or extend them in some way. Understanding
how these estimators work is vital for conducting research, for
reading and reviewing contemporary scholarship, and, frankly, for
being a good and valuable colleague in seminars and workshops.
Regression and regression estimators also provide an entry point for
discussing parametric models as approximations, rather than as rigid
assumptions about the truth of a given specification.
\end{enumerate}
Why write a book on statistics and regression when so many already
exist? While some texts at this level exist in the fields of statistics
and economics, they tend to focus on applications and models less
relevant to other social sciences. This book attempts to correct this.
The book also seeks to introduce a fairly high level of mathematical
sophistication that will challenge and push you to develop stronger
foundations in the material.
\hypertarget{roadmap}{%
\section*{Roadmap}\label{roadmap}}
\addcontentsline{toc}{section}{Roadmap}
\markright{Roadmap}
This book has two major parts. Part I introduces the basics of
statistical inference.
We start in Chapter~\ref{sec-design-based} by demonstrating basic
concepts of estimation and inference from the design-based perspective
in which we sample from a fixed, finite population, and all uncertainty
comes from randomness over who is and is not included in the sample.
This framework for inference has deep roots in the statistical
literature and provides a great deal of intuition for how estimation and
uncertainty work in simple settings. We will discuss how to use
design-based inference to estimate features of the population from
samples when the analyst knows the exact sampling design. Unfortunately,
researchers often lack this knowledge about how their data came to be,
limiting the usefulness of this approach.
Chapter~\ref{sec-model-based} introduces a more flexible approach to
estimation: model-based inference. With this approach, the researcher
posits a probability model for how the data came to be. This book
focuses on models that posit ``independent and identically distributed''
data for this model. The chapter describes how estimation and inference
proceed under these models and also introduces a broad class of
estimators based on the plug-in principle.
These two chapters focus on finite sample properties of different
estimation techniques, but we can say more about an estimator if we
consider how it behaves on larger and larger samples.
Chapter~\ref{sec-asymptotics} introduces this type of asymptotic
analysis. It covers the core results of asymptotic theory, such as the
law of large numbers, the central limit theorem, and the delta method,
but also shows why these results are important for statistical
inference. In particular, the chapter shows how these results enable the
creation of asymptotically valid confidence intervals.
Chapter~\ref{sec-hypothesis-tests} wraps up Part I of the book by
introducing statistical inference with hypothesis testing. This chapter
shows how to build hypothesis tests and provides intuition for all their
aspects. We also cover power analyses for planning studies and the
connection between confidence intervals and hypothesis tests.
Part II of the book focuses on one particular estimator of great
importance to quantitative social sciences: the least squares estimator.
Chapter~\ref{sec-regression} begins by describing exactly what quantity
of interest we are targeting when we discuss ``linear models.'' In
particular, we discuss how a population best linear predictor exists
even if the relationship between two variables is nonlinear. This
provides a coherent basis for linear regression estimation as a linear
approximation to a potentially nonlinear function. The chapter also
shows how to interpret the coefficients in these linear regression
models.
Chapter~\ref{sec-ols-mechanics} introduces the more mechanical
properties of the least squares estimator: how the estimator is
constructed, its geometrical interpretation, and how influential
observations may affect the estimates it returns. This chapter
introduces the least squares estimator in matrix form and provides key
intuition for understanding this compact notation.
Finally, Chapter~\ref{sec-ols-statistics} describes the statistical
properties of the least squares estimator. The chapter shows how
modeling assumptions affect the kinds of properties we can obtain. The
weakest modeling assumptions allow us to derive the surprisingly strong
asymptotic properties of least squares that we depend on in most
settings. The chapter then shows how stronger assumptions such as
linearity and normally distributed errors can provide even stronger
results but that they do so at the expense of potential model
misspecification.
\hypertarget{acknowledgements}{%
\section*{Acknowledgements}\label{acknowledgements}}
\addcontentsline{toc}{section}{Acknowledgements}
\markright{Acknowledgements}
Much of how I approach this material comes from Adam Glynn, for whom I
was a teaching fellow during graduate school. Thanks to the students of
Gov 2000 and Gov 2002 over years for helping me refine the material in
this book. Also very special thanks to those who have provided valuable
feedback including Zeki Akyol, Noah Dasanaike, Maya Sen, and Jarell
Cheong Tze Wen.
\hypertarget{colophon}{%
\section*{Colophon}\label{colophon}}
\addcontentsline{toc}{section}{Colophon}
\markright{Colophon}
You can find the source for this book at
\url{https://github.com/mattblackwell/gov2002-book}. Any typos or errors
can be reported at
\url{https://github.com/mattblackwell/gov2002-book/issues}. Thanks for
reading.
This is a Quarto book. To learn more about Quarto books visit
\url{https://quarto.org/docs/books}.
\(\,\) \(\,\)
\part{Statistical Inference}
\hypertarget{sec-design-based}{%
\chapter{Design-based Inference}\label{sec-design-based}}
\hypertarget{introduction}{%
\section{Introduction}\label{introduction}}
Quantitative analysis of social data has an alluring exactness to it. It
allows us to estimate the average number of minutes of YouTube videos
watched to the millisecond, and in doing so it gives us the aura of true
scientists. But the advantage of quantitative analyses lies not in the
ability to derive precise three-decimal point estimates; rather,
quantitative methods shine because they allow us to communicate
methodological goals, assumptions, and results in a (hopefully) common,
compact, and precise mathematical language. It is this language that
helps clarify \emph{exactly} what researchers are doing with their data
and why.
This dewy view of quantitative methods is unfortunately often at odds
with how these methods are used in the real world. All too often we as
researchers find some arbitrary data, apply a statistical tool with
which we are familiar, and then shoehorn the results into a theoretical
story that may or may not have a (tenuous) connection. Quantitative
methods applied this way will provide us with a very specific answer to
a murky question about a shapeless target.
This book is a guide to a better foundation for quantitative analysis
and, in particular, for statistical inference. Inference is the task of
using the data we have to learn something about the data we do not have.
The organizing motto of this book is to help us as researchers be
\begin{quote}
Precise in stating our goals, transparent in stating our assumptions,
and honest in evaluating our results.
\end{quote}
These goals are the target of our inference -- or what do we want to
learn and about whom.
In pursuing these goals, this book will focus on a general workflow for
statistical inference. The workflow boils down to answering a series of
questions about the goals, assumptions, and methods of our analysis:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
\textbf{Population}: who or what do we want to learn about?
\item
\textbf{Design/model}: how will we collect the data, or, what
assumptions are we making about how the data came to be?
\item
\textbf{Quantity of Interest}: what do we want to learn about the
population?\\
\item
\textbf{Estimator}: how will we use the data to produce an estimate?
\item
\textbf{Uncertainty}: how will we estimate and convey the error
associated with the estimate?
\end{enumerate}
These questions form the core of any quantitative research endeavor. And
the answers to these will draw on a mixture of substantive interests,
feasibility, and statistical theory, and this mixture will vary from
question to question. For example, the population of interest can vary
greatly from study to study, whereas many disparate studies may employ
the same estimand and estimator.
The third core question is particularly important, since it highlights
an essential division in how researchers approach statistical inference
-- specifically, \textbf{design-based inference} vs \textbf{model-based
inference}. Design-based inference typically focuses on situations in
which we have precise knowledge of how our sample was randomly selected
from the population. Uncertainty here comes exclusively from the random
nature of which observations are included in the sample. By contrast, in
the \textbf{model-based} framework, we treat our data as random
variables and propose a probabilistic model for how the data came to
exist. The models then vary in the strength of their assumptions.
Design-based inference is the framework that addresses the core
inferential questions most crisply, and so it is the focus of this
chapter. Its main disadvantages are that it is considerably less general
than the model-based approach and that the mathematics of the framework
are slightly more complicated.
We will now go over each of the core questions in more detail.
\hypertarget{question-1-population}{%
\section{Question 1: Population}\label{question-1-population}}
Inference is the task of using the data that we have to learn facts
about the world (i.e., the data we do not have). The most
straightforward setting is when we have a fixed set of units that we
want to learn something about. These units are what we call the
\textbf{population} or \textbf{target population}. We are going to focus
on random sampling from this population, but, to do so, we need to have
a list of units from the population. This list of \(N\) units is called
the \textbf{frame} or \textbf{sampling frame}, and we will index these
units in the sampling frame by \(i \in \mathcal{U} = \{1, \ldots, N\}\).
Here we assume that \(N\), the size of the population, is known, but
note that this may not always be true.
The sampling frame may differ from the target population simply for
feasibility reasons. For example, the target population might include
all the households in a given city, but the sampling frame might be the
list of all residential telephone numbers for that city. Of course, many
households do not have landline telephones and rely on mobile phone
exclusively. This gap between the target population and the sampling
frame is called \textbf{undercoverage} or \textbf{coverage bias}.
\begin{example}[]\protect\hypertarget{exm-frame-bias}{}\label{exm-frame-bias}
An early but prominent example of frame bias in survey sampling is the
infamous \emph{Literary Digest} poll of the 1936 U.S. presidential
election. \emph{Literary Digest}, a (now defunct) magazine, sent over 10
million ballots to addresses found in automobile registration lists and
telephone books, trying to figure out who would win the important 1936
presidential race. The sample size was huge: over 2 million respondents.
In the end, the results predicted that Alf Landon, the Republican
candidate, would receive 55\% of the vote, while the incumbent,
Democratic President Franklin D. Roosevelt, would only win 41\% of the
vote. Unfortunately for the \emph{Literary Digest}, Landon only received
37\% of the vote.
There are many possible reasons for this massive polling error. Most
obviously, the sampling frame was different from that of the target
population. Why? Only those with either a car or a telephone were
included in the sampling frame, and people without either overwhelmingly
supported the Democrat, Roosevelt. While this is not the only source of
bias -- differential nonresponse seems to be a particularly big problem
--the frame bias contributes a large part of the error. For more about
this poll, see SQUIRE (1988).
\end{example}
One advantage of design-based inference is how precisely we must
articulate the sampling frame. We can be extremely clear about the group
of units we are trying to learn about. We shall see that in model-based
inference the concept of the population and sampling frame become more
amorphous.
\begin{example}[American National Election Survey,
Population]\protect\hypertarget{exm-anes-population}{}\label{exm-anes-population}
According to the materials from the American National Election Survey
(ANES) in 2012, its target population is all U.S. citizens age 18 or
older. The sampling frame for the face-to-face portion of the survey
``consisted of the Delivery Sequence File (DSF) used by the United
States Postal Service'' for residential delivery of mail.''
Unfortunately, there are housing units that are covered by mail delivery
by the postal service which would result in the potential for frame
bias. The designers of the ANES used the Decennial Census to add many of
these units to the final sampling frame.
\end{example}
\hypertarget{question-2-sampling-design}{%
\section{Question 2: Sampling design}\label{question-2-sampling-design}}
Now that we have a clearly defined population and sampling frame, we can
consider how to select a sample from the population. We will focus on
\textbf{probabilistic samples}, where units are selected into the sample
by chance, and each unit in the sampling frame has a non-0 probability
of being included. Let \(\mathcal{S} \subset \mathcal{U}\) be a sample
and let \(\mb{Z} = (Z_1, Z_2, \ldots, Z_N)\) to be a vector of inclusion
indicators such that \(Z_i = 1\) if \(i \in \mathcal{S}\) and
\(Z_i = 0\) otherwise. We denote these indicators as upper-case letters
because they are random variables. We assume the sample size is
\(|\mathcal{S}| = n\).
Suppose our sampling frame was the hobbits who are members of the
Fellowship of the Ring, an exclusive group brought into being by a
wizened elf lord. This group of four hobbits is a valid -- albeit small
and fictional population -- with \(\mathcal{U} =\) \{Frodo, Sam, Pip,
Merry\}.
Suppose we want to sample two hobbits from this group. We can list all
six possible samples of size two from this population in terms of the
sample members \(\mathcal{S}\) or, equivalently, the inclusion
indicators \(\mb{Z}\):
\begin{itemize}
\tightlist
\item
\(\mathcal{S}_1 =\) \{Frodo, Sam\} with \(\mb{Z}_{1} = (1, 1, 0, 0)\)
\item
\(\mathcal{S}_2 =\) \{Frodo, Pip\} with \(\mb{Z}_{2} = (1, 0, 1, 0)\)
\item
\(\mathcal{S}_3 =\) \{Frodo, Merry\} with
\(\mb{Z}_{3} = (1, 0, 0, 1)\)
\item
\(\mathcal{S}_4 =\) \{Sam, Pip\} with \(\mb{Z}_{4} = (0, 1, 1, 0)\)
\item
\(\mathcal{S}_5 =\) \{Sam, Merry\} with \(\mb{Z}_{5} = (0, 1, 0, 1)\)
\item
\(\mathcal{S}_6 =\) \{Pip, Merry\} with \(\mb{Z}_{6} = (0, 0, 1, 1)\)
\end{itemize}
A \textbf{sampling design} is a complete specification of how likely to
be selected each of these samples is. That is, we need to determine a
selection probability \(\pi_j\) for each sample \(\mathcal{S}_j\). The
most widely used and widely studied design is one that places equal
probability on each of the possible samples of size \(n\).
\begin{definition}[]\protect\hypertarget{def-srs}{}\label{def-srs}
A \textbf{simple random sample} (srs) is a probability sampling design
where each possible sample of size \(n\) has the same probability of
occurring. More specifically, let \(\mb{z} = (z_{1}, \ldots, z_{N})\) be
a particular possible sampling, then, \[
\P(\mb{Z} = \mb{z}) = \begin{cases}
{N \choose n}^{-1} &\text{if } \sum_{i=1}^N z_i = n,\\
0 & \text{otherwise}
\end{cases}
\]
\end{definition}
If we sampled two hobbits, the srs (the simple random sample) would
place \(1/{4\choose 2} = 1/6\) probability of each of the above samples
\(\mathcal{S}_j\). Note that the srs gives zero probability to any
sample that does not have exactly \(n\) units in the sample.
Another common sampling design --the \textbf{Bernoulli sampling} design
-- works by choosing each unit independently with the same probability.
\begin{definition}[]\protect\hypertarget{def-srs}{}\label{def-srs}
\textbf{Bernoulli sampling} is a probability sampling design where
independent Bernoulli trials with probability of success \(q\) determine
whether each unit in the population will be included in the sample. More
specifically, let \(\mb{z} = (z_{1}, \ldots, z_{N})\) be a particular
possible sampling. Bernoulli sampling will then be \[
\P(\mb{Z} = \mb{z}) = \P(Z_1 = z_1) \cdots \P(Z_N = z_N) = \prod_{i=1}^N q_i^{Z_i}(1 - q_i)^{1-Z_i}
\]
\end{definition}
Bernoulli sampling is very straightforward because independently
selecting units simplifies many calculations. However, this ``coin
flipping'' approach means that the sample size,
\(N_s = \sum_{i=1}^N Z_i\), will be itself a random variable because it
is the result of how many of the coin flips land on ``heads.''
Simple random samples and Bernoulli random samples are simple to
understand and implement. For large surveys, the sampling designs are
often much more complicated for cost-saving reasons. We now describe the
sampling design for the ANES, which contains many design features
typical of similar large surveys.
\begin{example}[American National Election Survey, Sampling
Design]\protect\hypertarget{exm-anes-design}{}\label{exm-anes-design}
The ANES uses a typical yet complicated design for its 2012 face-to-face
survey. First, the designers divided (or stratified) U.S. states into
nine Census divisions (which are based on geography). Within each
division, designers then randomly sampled a number of census tracts
(with higher number of sampled tracts for divisions with higher
populations). The census tracts with larger populations are selected
with higher probability.
The second stage randomly samples addresses from the sampling frame
(described in Example~\ref{exm-anes-population}). More households were
sampled from tracts with higher proportion of Black and Latino residents
to obtain an oversample of these groups.
Finally, the third stage of sampling was to randomly select one eligible
person per household for completion of the survey.
\end{example}
\hypertarget{question-3-quantity-of-interest}{%
\section{Question 3: Quantity of
Interest}\label{question-3-quantity-of-interest}}
The \textbf{quantity of interest} is a numerical summary of the
population that we want to learn about. These quantities are also called
\textbf{estimands} ( Latin for ``the thing to be estimated'').
Let \(x_1, x_2, \ldots, x_N\) be a fixed set of characteristics, or
items, about the population. Using the statistician's favorite home
decor, we might think about our population as a set of marbles in a jar
where the \(x_i\) values indicate, for example, the color of the
\(i\)-th marble. In a survey, \(x_i\) might represent the age, ideology,
or income of the \(i\)-th person in the population.
We can define many useful quantities of interest based on the population
characteristics. These quantities generally summarize the values
\(x_1, \ldots, x_N\). One of the most common, and certainly one of the
most useful, is the \textbf{population mean}, defined as \[
\overline{x} = \frac{1}{N} \sum_{i=1}^N x_i.
\] The population mean is fixed because \(N\) and the population
characteristics \(x_1, \ldots, x_N\) are fixed. Another common estimand
in the survey sampling literature is the population total, \[
t = \sum_{i=1}^N x_i = N\overline{x}.
\]
\begin{example}[Subpopulation
means]\protect\hypertarget{exm-subpopulation}{}\label{exm-subpopulation}
We may also be interested in quantities for different subdomains.
Suppose we are interested in estimating the fraction of (say)
conservative-identifying respondents who support increasing legal
immigration. Let \(d= 1, \ldots, D\) be the number of subdomains or
subpopulations. In this case, we might have \(d = 1\) as liberal
identifiers, \(d = 2\) as moderate identifiers, and \(d = 3\) as
conservative identifiers. We will refer to the subpopulation for each of
these groups as \(\mathcal{U}_d \subset \{1,\ldots, N\}\) and we define
the size of these groups as \(N_d = |\mathcal{U}_d\). So, \(N_3\) would
be the number of conservative-identifying citizens in the population.
The mean for each group is then \[
\overline{x}_d = \frac{1}{N_d} \sum_{i \in \mathcal{U}_d} x_i.
\]
Subpopulation estimation can be slightly more complicated than
population estimation because we may not know who is in which
subpopulation until we actually sample the population. For example, our
sampling frame probably may not information about `potential
respondents' ideology. Thus, \(N_d\) will be unknown to the researcher,
unlike \(N\) for the population mean, which is known.
\end{example}
We may be interested in many other quantities of interest, but
design-based inference is largely focused on these types of population
and subpopulation means and totals.
\hypertarget{question-4-estimator}{%
\section{Question 4: Estimator}\label{question-4-estimator}}
Now that we have a sampling design and a quantity of interest, we can
consider what we can learn about this quantity of interest from our
sample. An \textbf{estimator} is a function of the sample measurements
intended as a best guess about our quantity of interest.
If the most common estimand is the population mean, the most popular
estimator is the \textbf{sample mean}, defined as \[
\overline{X}_n = \frac{1}{n} \sum_{i=1}^{N}Z_ix_i
\]
The sample mean is a \textbf{random} quantity since it varies from
sample to sample, and those samples are chosen probabilistically. For
example, suppose we have height measurements from our small population
of hobbits in Table~\ref{tbl-hobbit-pop}.
\hypertarget{tbl-hobbit-pop}{}
\begin{longtable}[]{@{}ll@{}}
\caption{\label{tbl-hobbit-pop}A small population of
hobbits}\tabularnewline
\toprule\noalign{}
Unit (\(i\)) & Height in cm (\(x_i\)) \\
\midrule\noalign{}
\endfirsthead
\toprule\noalign{}
Unit (\(i\)) & Height in cm (\(x_i\)) \\
\midrule\noalign{}
\endhead
\bottomrule\noalign{}
\endlastfoot
1 (Frodo) & 124 \\
2 (Sam) & 127 \\
3 (Pip) & 123 \\
4 (Merry) & 127 \\
\end{longtable}
If we consider a simple random sample of size \(n=2\) from this
population, we can list the probability of all possible sample means
associated with this sampling design as we do in
Table~\ref{tbl-hobbit-samples}. Table~\ref{tbl-sampling-dist} combines
the equivalent values of the sample mean to arrive at the
\textbf{sampling distribution} of the sample mean of hobbit height under
a srs of size 2.
\hypertarget{tbl-hobbit-samples}{}
\begin{longtable}[]{@{}
>{\raggedright\arraybackslash}p{(\columnwidth - 4\tabcolsep) * \real{0.2466}}
>{\raggedright\arraybackslash}p{(\columnwidth - 4\tabcolsep) * \real{0.3151}}
>{\raggedright\arraybackslash}p{(\columnwidth - 4\tabcolsep) * \real{0.4384}}@{}}
\caption{\label{tbl-hobbit-samples}All possible simple random samples of
size 2 from the hobbit population}\tabularnewline
\toprule\noalign{}
\begin{minipage}[b]{\linewidth}\raggedright
Sample (\(j\))
\end{minipage} & \begin{minipage}[b]{\linewidth}\raggedright
Probability (\(\pi_j\))
\end{minipage} & \begin{minipage}[b]{\linewidth}\raggedright
Sample mean (\(\overline{X}_n\))
\end{minipage} \\
\midrule\noalign{}
\endfirsthead
\toprule\noalign{}
\begin{minipage}[b]{\linewidth}\raggedright
Sample (\(j\))
\end{minipage} & \begin{minipage}[b]{\linewidth}\raggedright
Probability (\(\pi_j\))
\end{minipage} & \begin{minipage}[b]{\linewidth}\raggedright
Sample mean (\(\overline{X}_n\))
\end{minipage} \\
\midrule\noalign{}
\endhead
\bottomrule\noalign{}
\endlastfoot
1 (Frodo, Sam) & 1/6 & (124 + 127) / 2 = 125.5 \\
2 (Frodo, Pip) & 1/6 & (124 + 123) / 2 = 123.5 \\
3 (Frodo, Merry) & 1/6 & (124 + 127) / 2 = 125.5 \\
4 (Sam, Pip) & 1/6 & (127 + 123) / 2 = 125 \\
5 (Sam, Merry) & 1/6 & (127 + 127) / 2 = 127 \\
6 (Pip, Merry) & 1/6 & (123 + 127) / 2 = 125 \\
\end{longtable}
\hypertarget{tbl-sampling-dist}{}
\begin{longtable}[]{@{}ll@{}}
\caption{\label{tbl-sampling-dist}Sampling distribution of the sample
mean for simple random samples of size 2 from the hobbit
population}\tabularnewline
\toprule\noalign{}
Sample mean & Probability \\
\midrule\noalign{}
\endfirsthead
\toprule\noalign{}
Sample mean & Probability \\
\midrule\noalign{}
\endhead
\bottomrule\noalign{}
\endlastfoot
123.5 & 1/6 \\
125 & 1/3 \\
125.5 & 1/3 \\
127 & 1/6 \\
\end{longtable}
Thus, the sampling distribution tells us what values of an estimator are
more or less likely and depends on both the population distribution and
the sampling design.
\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-note-color-frame]
Notice that the sampling distribution of an estimator will depend on the
sampling design. Here, we used a simple random sample. Bernoulli
sampling would have produced a different distribution. Using Bernoulli
sampling, we could end up with a sample of just Frodo, in which case the
sample mean would be his height (124cm), a sample mean value that is
impossible with simple random sampling of size \(n=2\).
\end{tcolorbox}
\hypertarget{properties-of-the-sampling-distribution-of-an-estimator}{%
\subsection{Properties of the sampling distribution of an
estimator}\label{properties-of-the-sampling-distribution-of-an-estimator}}
Generally speaking, we want ``good'' estimators. But what makes an
estimator ``good'\,'? The best estimator would obviously be the one that
is right all of the time (\(\Xbar_n = \overline{x}\) with probability
1), but this is only possible if we conduct a census --that is, sample
everyone in the population -- or the population does not vary. Neither
situation is typical for most researchers.
We instead focus on properties of the sampling distribution of an
estimator. The following types of questions get at these properties:
\begin{itemize}
\tightlist
\item
Are the estimator's observed values (realizations) centered on the
true value of the quantity of interest? (unbiasedness)
\item
Is there a lot or a little variation in the realizations of the
estimator across different samples from the population? (sample
variance)
\item
On average, how close to the truth is the estimator? (mean square
error)
\end{itemize}
The answers to these questions will depend on (a) the estimator and (b)
the sampling design.
To back up, the sampling distribution shows us all the possible values
of an estimator across different samples from the population. If we want
to summarize this distribution with a single number, we would focus on
its expectation, which is a measure of central tendency of the
distribution. Roughly speaking, we want the center of the distribution
to be close to and ideally equal to the true quantity of interest. If
this is not the case, that means the estimator systematically over- or
under-estimates the truth. We call this difference the \textbf{bias} of
an estimator, which can be written mathematically as \[
\textsf{bias}[\Xbar_{n}] = \E[\Xbar_{n}] - \overline{x}.
\] Any estimator that has bias equal to zero is call an
\textbf{unbiased} estimator.
We can calculate the bias of our hobbit srs (where we sampled two
hobbits from the Fellowship of the Ring with equal probability) by first
calculating the expected value of the estimator, \[
\E[\Xbar_{n}] = \frac{1}{6}\cdot 123.5 + \frac{1}{3} \cdot 125 + \frac{1}{3} \cdot 125.5 + \frac{1}{6} \cdot 127 = 125.25,
\] and comparing this to the population mean, \[
\overline{x} = \frac{1}{4}\left(124 + 127 + 123 + 127\right) = 125.25.
\] The two are the same, meaning the sample mean in this simple random
sample is unbiased.
\begin{tcolorbox}[enhanced jigsaw, opacitybacktitle=0.6, left=2mm, coltitle=black, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Warning}, rightrule=.15mm, toprule=.15mm, colback=white, bottomrule=.15mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, titlerule=0mm, arc=.35mm, breakable, leftrule=.75mm, toptitle=1mm, opacityback=0, colframe=quarto-callout-warning-color-frame]
Note that the word ``bias'' sometimes also refers to research that is
systematically incorrect in other ways. For example, we might complain
that a survey question is biased if it presents a leading or misleading
question or if it mismeasures the concept of interest. To see this,
suppose we wanted to estimate the proportion of a population that
regularly donates money to a political campaign, but \(x_i\) actually
measures whether a person donated on the day of the survey. In this