Skip to content

Commit

Permalink
deploy: fec2e56
Browse files Browse the repository at this point in the history
  • Loading branch information
alexanderthclark committed Mar 19, 2024
1 parent 4dfb72a commit bd9537f
Show file tree
Hide file tree
Showing 8 changed files with 133 additions and 26 deletions.
61 changes: 58 additions & 3 deletions _sources/sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,36 @@ In practice, finding a good sample statistic is difficult. There are nuances tha

Each hurdle presents an opportunity for bias in the final estimate.

One important type of bias is **selection bias**, which occurs when the sample systematically excludes one kind of person. Suppose a political pollster sets up a table at the local creamery, asking people who they'll vote for in an American election as they shop for organic milk. The sensible population of interest is American voters and the parameters of interest are the percentage voting for each candidate. However, the creamery setting *selects* for a sort of yuppie person who can tolerate lactose. People of Northern European descent have the lowest rate of lactose intolerant, so the sample is unlikely to be representative of America.
One important type of bias is **selection bias**, which occurs when the sample systematically excludes one kind of person. Suppose a political pollster sets up a table at the local creamery, asking people who they'll vote for in an American election as they shop for organic milk. The sensible population of interest is American voters and the parameters of interest are the percentage voting for each candidate. However, the creamery setting *selects* for a sort of yuppie person who can tolerate lactose. People of Northern European descent have the lowest rate of lactose intolerance, so the sample is unlikely to be representative of America.

The pollster might have a smarter colleague who assembles a list of names that is truly representative of all voters. The next hurlde is getting responses. If a phone survey is used, there will be many who don't answer. This is a problem *if* those who don't respond are systematically different than those who do respond. In fact, old people are usually more available and willing to answer the phone, potentially leading to bias. A systematic difference between respondents and non-respondents is called **non-response bias**.
The pollster might have a smarter colleague who assembles a list of names that is truly representative of all voters. The next hurdle is getting responses. If a phone survey is used, there will be many who don't answer. This is a problem *if* those who don't respond are systematically different than those who do respond. In fact, old people are usually more available and willing to answer the phone, potentially leading to bias. A systematic difference between respondents and non-respondents is called **non-response bias**.

The specific problem of an unrepresentative distribution of ages in a sample can be solved by using **quota sampling**. In quota sampling, the sample is constructed to resemble the population of interest with respect to key characteristics. This can help, but it doesn't guarantee representativeness because not all important characteristics can be understood.

**Example**: Imagine we want to estimate the public's satisfaction with the US Postal Service. Opinions differ by age and by their curmudgeonliness. Perhaps older people like USPS more, but curmudgeons will not like USPS regardless of age *and* curmudgeons will not respond. It's easy to quota sample based on age, and a quota sample will result in an average satisfaction rating of 9 based on {numref}`age-curmudg-table`. This is better than not using quota sampling and constructing a sample that overrepresents older non-curmudgeons. However, the true average rating is 7. Quota sampling can't give a good estimate of the actual population parameter because quotas for old or young people will systematically exclude curmudgeons.

```{list-table} Average USPS Satisfaction by Age and Curmudgeonliness
:header-rows: 1
:name: age-curmudg-table
* -
- Bottom 50% Age
- Top 50% Age
* - Bottom 50% Curmudgeon
- 8
- 10
* - Top 50% Curmudgeon
- 5
- 5
```

The last hurdle in surveying is getting truthful responses. People might provide socially acceptable answers instead of truthful answers, especially when the survey involves talking to an actual person. Answers can be influenced by the wording of questions. These phenomena all fall under the umbrella of **response bias**.

The Trafalgar Group is an opinion polling company, notable for predicting that Donald Trump would win the 2016 US presidential election while most polls were wrong. One proposed explanation for why so many polls were wrong is the theory of the "shy Trump voter." A shy Trump voter is someone who voted for Trump but indicated the opposite to pollsters, motivated by fear of social stigma. This is a case of response bias and correcting for it requires some artifice. The Trafalgar group attempted to account for that by asking respondents how they think their neighbors will vote.

### Non-response in the American Time Use Survey

{cite}`abraham2006nonresponse` reports different response rates for the American Time Use Survey, partially reproduced in {numref}`atusresprates`. The differing response rates present a mine field for researchers.
{cite}`abraham2006nonresponse` reports different response rates for the American Time Use Survey, partially reproduced in {numref}`atusresprates`. The differing response rates present a minefield for researchers.

```{list-table} 2004 ATUS Response Rates (Table 2 from [AMB06])
:header-rows: 1
Expand Down Expand Up @@ -104,4 +127,36 @@ name: household-survey-respons
Source: [Bureau of Labor Statistics](https://www.bls.gov/osmr/response-rates/)
```

### Probability Methods

A good sample is representative of the population of interest. Like in experiments, the best way to ensure representativeness is to use randomization. A **simple random sample** is one drawn at random without replacement. When feasible, this is a good method. Feasibility requires there's some big list of names to draw from. For people who will vote in the next election, there is no such list. So, political polling is complicated and the nuts and bolts are beyond our scope.

Note that probability methods are important regardless of the sample size. An estimate can be decomposed as

$$\text{estimate = parameter + bias + chance error}.$$

The sample size controls the chance error, with larger samples reducing the chance error, but a bad sampling technique is vulnerable to bias no matter the sample size.


## Exercises

```{exercise-start}
:label: incentivesurvey
```
A user researcher at a large ecommerce company wants to have a representative sample of its users take a survey.

1. They create a survey that is visible to anyone who completes an order on a single day. Describe at least one source of bias.

2. They email 100 users, drawn without replacement, from a list of all people holding an account with the company. What kind of sample is this?

3. In the email to 100 users, they offer a $10 gift card to anyone who completes the survey. What kind of bias is this designed to prevent?

```{exercise-end}
```







6 changes: 3 additions & 3 deletions chancevary.html

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions correlation.html

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions normal.html

Large diffs are not rendered by default.

Binary file modified objects.inv
Binary file not shown.
Loading

0 comments on commit bd9537f

Please sign in to comment.