-
Notifications
You must be signed in to change notification settings - Fork 4
/
Lab5_Sampling_Distributions.Rmd
335 lines (206 loc) · 17.9 KB
/
Lab5_Sampling_Distributions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
```{r, include = FALSE}
source("global_stuff.R")
```
# Sampling Distributions
"10/2/2020 | Last Compiled: `r Sys.Date()`"
## Readings
@crumpAnsweringQuestionsData2018, [4.8 - 4.10](https://crumplab.github.io/statistics/probability-sampling-and-estimation.html#samples-populations-and-sampling)
## Review
<div class="videoWrapper"> <iframe width="560" height="315" src="https://www.youtube.com/embed/xbQaPWdKUp0" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </div>
In the last two labs we have begun to explore distributions, and have used R to create and sample from distributions. We will continue this exploration in many of the following labs.
As we continue I want to point out a major conceptual goal that I have for you. This is to have a nuanced understanding of one statement and one question:
1. Chance can do things
2. What can chance do?
To put this into concrete terms from previous labs. You know that 'chance' can produce different outcomes when you flip a coin. This is a sense in which chance do things. A 50% chance process can sometimes make a heads, and sometimes a tails. Also, we have started to ask "what can chance do?" in prior labs. For example, we asked how often chance produces 10 heads in a row, when you a flip a coin. We found that chance doesn't do that very often, compared to say 5 heads and 5 tails.
We are using these next labs to find different ways to use R to experience 1) **that** chance can do things, and 2) how **likely** it is that some things happen by chance. We are working toward a third question (for next lab), which is 3) did chance do it?...or when I run an experiment, is it possible that chance alone produced the data that was collected?
## Overview
This lab has the following modules:
1. Conceptual Review I: Probability Distributions
- we review sampling from probability distributions using R and examine a few additional aspects of base R distribution functions
2. Conceptual II: Sampling Distributions
- we use R to create a new kind of distribution, called a sampling distribution. This will prepare you for future statistics lectures and concepts that all fundamentally depend on sampling distributions.
- **Understanding sampling distributions may well be the most fundamental thing to understand about statistics** (but that's just my opinion).
## Probability Distributions
In previous labs we learned that it is possible to sample numbers from particular distributions in R.
```
#see all the distribution functions
?distributions
```
### Normal Distribution
We use `rnorm()` to sample numbers from a normal distribution:
```{r}
rnorm(n=10, mean = 0, sd = 1)
```
We can 'see' the distribution by sampling a large number of observations, and plotting them in a histogram:
```{r}
library(ggplot2)
some_data <- data.frame(observations = rnorm(n=10000, mean = 0, sd = 1),
type = "A")
ggplot(some_data, aes(x=observations)) +
geom_histogram(bins=100, color="black",
fill= 'orange')
```
We can see in this example that using random chance to sample from this distribution caused these numbers to be observed. So, we can see that "chance did something". We can also see that chance did some things more than others. Values close to 0 were sampled much more often than values larger 2.5.
How often did we sample a value larger than 2.5? What is the probability that chance would produce a value larger than 2.5? What about any other value for this distribution? To answer these questions, we need to get more specific about exactly what chance in this situation, for this distribution, is capable of doing.
1. We can answer a question like the above through observation. We can look at the sample that we generated, and see how many numbers out of the total are larger than a particular value:
```{r}
some_data$observations[some_data$observations > 2.5]
length(some_data$observations[some_data$observations > 2.5])
length(some_data$observations[some_data$observations > 2.5])/10000
```
2. We could also compute the probability directly using analytical formulas. And, these formulas also exist in R. Specifically, distribution formulas begin with `d`, `p`, `q`, and `r`, so there are `dnorm`, `pnorm`, `qnorm`, and `rnorm` functions for the normal distribution (and other distributions).
#### rnorm()
`rnorm(n, mean = 0, sd = 1)` samples observations (or random deviates) from a normal distribution with specified mean and standard deviation.
```{r}
rnorm(n=10, mean = 0, sd = 1)
```
#### dnorm()
`dnorm(x, mean = 0, sd = 1, log = FALSE)` is the probability density function. It returns the probability density of the distribution for any value that can be obtained from the distribution.
For example, in the above histogram, we can see that the distribution produces values roughly between -3 and 3, or perhaps -4 to 4. We also see that as values approach 0, they happen more often. So, the probability density changes across the distribution. We can plot this directly using `dorm()`, by supplying a sequence of value, say from -4 to 4.
```{r}
library(ggplot2)
some_data <- data.frame(density = dnorm(-4:4, mean = 0, sd = 1),
x = -4:4)
knitr::kable(some_data)
ggplot(some_data, aes(x=x, y=density)) +
geom_point()
```
To generate a plot of the full distribution in R, you could calculate additional intervening values on the x-axis, and use a line plot.
```{r}
some_data <- data.frame(density = dnorm(seq(-4,4,.001), mean = 0, sd = 1),
x = seq(-4,4,.001))
ggplot(some_data, aes(x=x, y=density)) +
geom_line()
```
Note, that this probability density function could be consulted to ask a question like, what is the probability of getting a value larger than 2.5? The answer is given by the area under the curve (in red).
```{r}
library(dplyr)
some_data <- data.frame(density = dnorm(seq(-4,4,.001), mean = 0, sd = 1),
x = seq(-4,4,.001))
region_data <- some_data %>%
filter(x > 2.5)
ggplot(some_data, aes(x=x, y=density)) +
geom_line()+
geom_ribbon(data = region_data,
fill = "red",
aes(ymin=0,ymax=density))
```
### pnorm
`pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)` takes a given value that the distribution could produce as an input called `q` for quantile. The function then returns the proportional area under the curve up to that value.
For example, what is the area under the curve starting from the left (e.g., negative infinity) all the way to 2.5?
```{r}
pnorm(2.5, mean=0, sd = 1)
```
This is the "complement" of our question. That is 99.37903% of all values drawn from the distribution are expected to be **less than 2.5**. We have just determined the probability of getting a number smaller than 2.5, otherwise known as the lower tail.
```{r}
some_data <- data.frame(density = dnorm(seq(-4,4,.001), mean = 0, sd = 1),
x = seq(-4,4,.001))
region_data <- some_data %>%
filter(x < 2.5)
ggplot(some_data, aes(x=x, y=density)) +
geom_line()+
geom_ribbon(data = region_data,
fill = "red",
aes(ymin=0,ymax=density))
```
By default, `pnorm` calculates the lower tail, or the area under the curve from the `q` value point to the left side of the plot.
To calculate the the probability of getting a number larger than a particular value you can take the complement, or set `lower.tail=FALSE`
```{r}
1 - pnorm(2.5, mean=0, sd = 1)
pnorm(2.5, mean=0, sd = 1, lower.tail=FALSE)
```
#### qnorm
`qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)` is similar to `pnorm`, but it takes a probability as an input, and returns a specific point value (or quantile) on the x-axis. More formally, you specify a proportional area under the curve starting from the left, and the function tells which number that area corresponds to.
For the remaining we assume mean=0 and sd =1.
What number on the x-axis is the location where 25% of the values are smaller than that value?
```{r}
qnorm(.25, mean= 0, sd =1)
```
What number on the x-axis is the location where 50% of the values are smaller than that value?
```{r}
qnorm(.5, mean= 0, sd =1)
```
What is the number where 95% of the values are larger than that number?
```{r}
qnorm(.95, mean= 0, sd =1, lower.tail = FALSE)
qnorm(.05, mean = 0 , sd =1, lower.tail = FALSE)
```
### Summary
Focusing on the normal distribution functions, we have learned that chance can produce different kinds of numbers from a normal distribution. And, we can use the `dnorm`, `qnorm`, and `pnorm` functions to exactly compute the specific probabilities that **certain ranges of values** occur.
## Conceptual: Sampling Distributions
When we collect data we assume they come from some "distribution", which "causes" some numbers to occur more than others. The data we collect is a "sample" or portion of the "distribution".
We know that when we sample from distributions chance can play a role. Specifically, by chance alone one sample of observations could look different from another sample, even if they came from the same distribution. In other words, **we recognize that the process of sampling from a distribution involves variability or uncertainty**.
We can use **sampling distributions** as a tool to help us understand and predict how our sampling process will behave. This way we can have information about how variable or uncertain our samples are.
### Confusing jargon
Throughout this course you will come across terms like, "sampling distributions", "the sampling distribution of the sample mean", "the sampling distribution of any sample statistic", "the standard deviation of the sampling distribution of the sample mean is the standard error of the mean". Although all of these sentence are hard to parse and in my opinion very jargony, they all represent very important ideas that need to be distinguished and well understood. We are going to work on these things in this lab.
### The sample mean
For all of the remaining examples we will use a normal distribution with mean = 0 and sd =1.
We already know what the sample mean is and how to calculate it in R. Here is an example of calculating a sample mean, where the number of observations (n) in the sample is 10.
```{r}
mean(rnorm(10, mean=0, sd =1))
```
### Multiple sample means
We can repeat the above process as many times as we like, each time creating a sample of 10 observations and computing the mean.
Here is an example of creating 5 sample means from 5 sets of 10 observations.
```{r}
mean(rnorm(10, mean=0, sd =1))
mean(rnorm(10, mean=0, sd =1))
mean(rnorm(10, mean=0, sd =1))
mean(rnorm(10, mean=0, sd =1))
mean(rnorm(10, mean=0, sd =1))
```
Notice each of the sample means is different, this is because of the variability introduced by randomly choosing values from the same normal distribution.
The mean of the distribution that the samples come from is 0, what do we expect the mean of the samples to be? In general, we expect 0, but we can see that not all of the sample means are exactly 0.
How much variability can we expect for our sample mean? In other words, if we are going to obtain a sample of 10 numbers from this distribution, **what kinds of sample means could we get?**.
### The sampling distribution of the sample means
The answer to the question **what kinds of sample means could we get?** is "the sampling distribution of the sample means". In other words, if you do the work to actually find out and create a bunch of samples, and then find their means, then you have a bunch of sample means, and all of these numbers form a distribution. This distribution is effectively showing you all of the different ways that random chance can produce particular sample means.
Let's make a distribution of sample means. We will create 10,000 samples, each with 10 observations, and compute the mean for each. We will save and look at all of the means in a histogram.
```{r}
sample_means <- replicate(10000, mean(rnorm(10,0,1)))
hist(sample_means)
```
The above is a histogram representing the means for 10,000 samples. We can refer to this as a sampling distribution of the sample means. It is how sample means (in this one particular situation) are generally distributed.
If you wanted to know what to expect from a single sample mean (if you knew you were taking values from this normal distribution), then you could look at this sampling distribution.
Sample means close to 0 happen the most. So, most of the time, when you take a sample from this distribution, the mean of that sample will be 0. It is rare for a sample mean to be larger than .5. It is very very rare for a sample mean to be larger than 1.
### The standard error the mean
We have not discussed the concept or definition of the standard error the mean in the lecture portion of this class. However, we have seen that the mean for a sample taken from a distribution has expected variability, specifically the fact that there is a distribution of different sample means shows that there is variability.
What descriptive statistic have we already discussed that provides measures of variability? One option is the standard deviation. For example, we could do the following to measure the variability associated with a distribution of sample means.
1. Generate a distribution of sample means
2. Calculate the standard deviation of the sample means
The standard deviation of the sample means would give us an idea of how much variability we expect from our sample mean. We could do this quickly in R like this:
```{r}
sample_means <- replicate(10000, mean(rnorm(10,0,1)))
sd(sample_means)
```
The value we calculated is a standardized unit, and it describes the amount of error we expect in general from a sample mean. Specifically, if the true population mean is 0, then when we obtain samples, we expect the sample means will have some error, they should on average be 0, but plus or minus the standard deviation we calculated.
We do not necessarily have to generate a distribution of sample means to calculate the standard error. If we know the population standard deviation ($\sigma$), then we can use this formula for the standard error of the mean (SEM) is:
$\text{SEM} = \frac{\sigma}{\sqrt{N}}$
where $\sigma$ is the population standard deviation, and $N$ is the sample size.
We can also compare the SEM from the formula to the one we obtained by simulation, and we find they are similar.
```{r}
# simulation SEM
sample_means <- replicate(10000, mean(rnorm(10,0,1)))
sd(sample_means)
# analytic SEM
1/sqrt(10)
```
## Lab 5 Generalization Assignment
<div class="videoWrapper"> <iframe width="560" height="315" src="https://www.youtube.com/embed/rjp7kmIo2Qk" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </div>
### Instructions
In general, labs will present a discussion of problems and issues with example code like above, and then students will be tasked with completing generalization assignments, showing that they can work with the concepts and tools independently.
Your assignment instructions are the following:
1. Work inside the R project "StatsLab1" you have been using
2. Create a new R Markdown document called "Lab5.Rmd"
3. Use Lab5.Rmd to show your work attempting to solve the following generalization problems. Commit your work regularly so that it appears on your Github repository.
4. **For each problem, make a note about how much of the problem you believe you can solve independently without help**. For example, if you needed to watch the help video and are unable to solve the problem on your own without copying the answers, then your note would be 0. If you are confident you can complete the problem from scratch completely on your own, your note would be 100. It is OK to have all 0s or 100s anything in between.
5. Submit your github repository link for Lab 5 on blackboard.
6. There are five problems to solve
### Problems
1. Trust but verify. We trust that the `rnorm()` will generate random deviates in accordance with the definition of the normal distribution. For example, we learned in this lab, that a normal distribution with mean = 0, and sd =1 , should only produce values larger than 2.5 with a specific small probability, that is P(x>2.5) = 0.006209665. Verify this is approximately the case by randomly sampling 1 million numbers from this distribution, and calculate what proportion of numbers are larger than 2.5. (1 point)
2. If performance on a standardized test was known to follow a normal distribution with mean 100 and standard deviation 10, and 10,000 people took the test, how many people would be expected to achieve a score higher than 3 standard deviations from the mean? (1 point)
3. You randomly sample 25 numbers from a normal distribution with mean = 10 and standard deviation = 20. You obtain a sample mean of 12. You want to know the probability that you could have received a sample mean of 12 or larger.
Create a sampling distribution of the mean for this scenario with at least 10,000 sample means (1 point). Then, calculate the proportion of sample means that are 12 or larger (1 point).
4. You randomly sample **100** numbers from a normal distribution with mean = 10 and standard deviation = 20. You obtain a sample mean of 12. You want to know the probability that you could have received a sample mean of 12 or larger.
Create a sampling distribution of the mean for this scenario with at least 10,000 sample means. Then, calculate the proportion of sample means that are 12 or larger. Is the proportion different from question 3, why? (1 point).
5. You randomly sample 25 numbers from a normal distribution with mean = 10 and standard deviation = 20. You obtain a sample standard deviation of 15. You want to know the probability that you could have received a sample standard deviation of 15 or less.
Create a sampling distribution of standard deviations for this scenario with at least 10,000 sample standard deviations. Then, calculate the proportion of sample standard deviations that are 15 or less. (1 point)