-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathpoisson.Rmd
269 lines (210 loc) · 11.9 KB
/
poisson.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
# Poisson Distribution {#poisson}
## Motivating Example {-}
<iframe width="560" height="315" src="https://www.youtube.com/embed/yIPFA1sk5NA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
You are in a room of \(n\) people (including yourself). Each person in the room has contributed $1 to a central pot,
so there is a total of $\(n\) in the pot. The money in the pot will be redistributed back to the people in the room,
in the following way: each dollar is equally likely to go to any one of the \(n\) people, independently of the other
dollars in the pot. This means that some people could get more than $1, while others end up with nothing.
As $n\to\infty$, what is the probability that you end up with no money? There are two common schools of thought:
1. As $n\to\infty$, the number of dollars in the pot increases to infinity, so it seems that
the probability that you end up with at least one of those dollars should approach $1$, i.e., the probability
that you end up with no money approaches 0.
2. As $n\to\infty$, the chance that you earn each dollar, $1/n$, decreases to 0, so it seems that
the probability that you end up with no money approaches 1.
Which school of thought is correct? As it turns out, both arguments are wrong. Let's see why.
We can model the _number of dollars you get_ as the number of $\fbox{1}$s in $n$ draws, with replacement, from the box
\[ \overbrace{\fbox{1}\ \underbrace{\fbox{0}\ \ldots\ \fbox{0}}_{n-1}}^{n}. \]
(Each ticket in the box represents a person in the room. Each draw represents a dollar, and you only get that dollar if your
name is drawn.)
This is exactly the description of a $\text{Binomial}(n, p=\frac{1}{n})$ distribution. So the probability that you end up with
no money is
\begin{align*}
f(0) &= \binom{n}{0} \left( \frac{1}{n} \right)^{0} \left( 1 - \frac{1}{n} \right)^{n} \\
&= \left(1 - \frac{1}{n} \right)^n.
\end{align*}
What is the limit of this probability as $n\to\infty$? Try plugging in some very large numbers for $n$ into the expression
above. You will see that the probability is nowhere near 0 or 1. But what is it approaching?
This is a very famous limit in mathematics. In case you are not familiar with it, take the (natural) logarithm to bring the
$n$ down from the exponent, and apply L'Hôpital's Rule:
\begin{align*}
\lim_{n\to\infty} \log \left(1 - \frac{1}{n} \right)^n &= \lim_{n\to\infty} n \log \left(1 - \frac{1}{n} \right) & \text{(Property of logs: $\log b^a = a \log b$)} \\
&= \lim_{x \to 0} \frac{\log \left(1 - x \right)}{x} & \text{(Let $x = 1/n$.)} \\
&= \lim_{x \to 0} \frac{\frac{d}{dx} \log \left(1 - x \right)}{\frac{d}{dx} x} & \text{(L'Hôpital's Rule on $0/0$ indeterminate form)} \\
&= \lim_{x \to 0} \frac{-\frac{1}{1 - x}}{1} & \text{(Remember your derivatives?)} \\
&= -1.
\end{align*}
Remember that $-1$ is the limit of the logarithm, so to obtain the limit of the original expression, we need to "undo" the
logarithm, i.e., exponentiate:
\begin{equation}
\lim_{n\to\infty} \left(1 - \frac{1}{n} \right)^n = e^{-1}.
(\#eq:poisson-example)
\end{equation}
So the probability that you end up with no money approaches $1/e \approx 0.368$ as $n\to\infty$.
## Theory {-}
The motivating example above illustrates a general phenomenon. A binomial distribution with large $n$ and small $p$ can be
approximated by a p.m.f. involving the constant $e$. The next theorem makes this precise.
```{theorem poisson, name="Poisson Distribution"}
Let $X$ be a $\text{Binomial}(n, p=\frac{\mu}{n})$ random variable, where $\mu$ is a constant. Then, as $n\to\infty$,
the p.m.f. of $X$ approaches
\begin{align}
f(x) &= e^{-\mu}\dfrac{\mu^x}{x!} & x&=0, 1, 2, ....
(\#eq:poisson)
\end{align}
A random variable with \@ref(eq:poisson) as its p.m.f. is said to follow a $\text{Poisson}(\mu)$ distribution.
```
```{example}
In the motivating example, $\mu = 1$, so $f(0) = e^{-1} \frac{1^0}{0!} = e^{-1}$, which matches \@ref(eq:poisson-example).
```
Even when the data does not originate from a binomial model, it is common to assume that count data follow a Poisson
distribution. In the next example, we are explicitly told that the random variable follows a Poisson distribution.
```{example, name="Database Queries"}
The number of typos in a New York Times op-ed when it reaches the copy editor
is a $\text{Poisson}(\mu=4.6)$ random variable.
What is the probability that there are 2 or more typos in a randomly selected op-ed?
```
```{solution}
Let $X$ be the number of typos in the op-ed. We want to know $P(X \geq 2)$. We could calculate this directly, but that
would involve summing the p.m.f. at infinitely many values. Instead, we use the Complement Rule (\@ref(thm:complement)):
\begin{align*}
P(X \geq 2) &= 1 - P(X < 2) \\
&= 1 - f(0) - f(1) \\
&= 1 - e^{-4.6} \frac{4.6^0}{0!} - e^{-4.6} \frac{4.6^1}{1!} \\
&= 1 - e^{-4.6} - 4.6 e^{-4.6} \\
&\approx .944.
\end{align*}
```
Why might the Poisson distribution be a reasonable model for the number of typos?
- Each op-ed has many words (e.g., $n = 1000$).
- There is a small probability that each word has a typo (e.g., $p = .0046$).
- If typos are independent across words, then the number of typos follows a binomial distribution.
- Since $n$ is large and $p$ is small, this binomial distribution can be approximated by a $\text{Poisson}(\mu=np=4.6)$ distribution.
However, all of this is conjecture. We are not told how many words the op-ed has, nor the probability that each
word has a typo. We simply assume that the number of typos follows a Poisson distribution. In practice, the Poisson
model is often used for count data, even when there is no underlying binomial model.
Finally, we include the proof of Theorem \@ref(thm:poisson) for completeness. This proof is not particularly
interesting or insightful, so you do not need to know it.
```{proof, name="Theorem \\@ref(thm:poisson), optional"}
Substitute $p = \frac{\mu}{n}$ into the binomial p.m.f. and watch what happens as $n \to \infty$:
\begin{align*}
f(x) &= \binom{n}{x} (\frac{\mu}{n})^x (1 - \frac{\mu}{n})^{n-x} \\
&= \frac{n!}{x!(n-x)!} \frac{\mu^x}{n^x} (1 - \frac{\mu}{n})^{n-x} \\
&= \frac{n!}{(n-x)! n^x} \cdot \frac{\mu^x}{x!} \cdot (1 - \frac{\mu}{n})^{n-x} \\
&\to 1 \cdot \frac{\mu^x}{x!} \cdot e^{-\mu}
\end{align*}
```
### Visualizing the Distribution {-}
Let's graph the Poisson distribution for different values of $\mu$.
```{r poisson-pmf-1, echo=FALSE, fig.show = "hold", fig.align = "default", fig.asp=0.5}
library(latex2exp)
xs <- 0:20
par(mfrow=c(1, 3))
mu <- 0.9
plot(xs, dpois(xs, mu), pch=19, xlab="x", ylab="f(x)", ylim=c(0, 0.4),
main=TeX(paste("Poisson$(\\mu = ", mu, ")$")))
for(x in xs) {
lines(c(x, x), c(0, dpois(x, mu)), lwd=2)
}
mu <- 5.1
plot(xs, dpois(xs, mu), pch=19, xlab="x", ylab="f(x)", ylim=c(0, 0.4),
main=TeX(paste("Poisson$(\\mu = ", mu, ")$")))
for(x in xs) {
lines(c(x, x), c(0, dpois(x, mu)), lwd=2)
}
mu <- 10.2
plot(xs, dpois(xs, mu), pch=19, xlab="x", ylab="f(x)", ylim=c(0, 0.4),
main=TeX(paste("Poisson$(\\mu = ", mu, ")$")))
for(x in xs) {
lines(c(x, x), c(0, dpois(x, mu)), lwd=2)
}
```
The probability mass shifts to the right as $\mu$ increases. This makes sense if we
interpret the Poisson distribution as an approximation to the binomial, where $\mu=np$.
To increase $\mu$, we must either increase $n$ or $p$, both of which tend to increase
the number of $\fbox{1}$s that we draw.
### Calculating Poisson Probabilities on the Computer {-}
The Poisson distribution is built into many software packages.
```{example, name="Disk Failures at a Data Center"}
A data center has 10,000 disk drives. Suppose that a disk drive fails in a given day with
probability $10^{-3}$.
a. What is the probability that there are 12 or more disk failures tomorrow?
b. How many spare disk drives should be available so that all failures in a day can be replaced with probability 99%?
```
```{solution}
First, we will set up a box model for _the number of disk failures_. We have a box with
- $N_0 = 999$ tickets labeled $\fbox{0}$
- $N_1 = 1$ tickets labeled $\fbox{1}$
to represent the $10^{-3}$ probability of failure. We will draw 10,000 tickets from this box,
with replacement, to represent whether each of the 10,000 disks fails.
This shows that the number of failures, which we will call $X$,
follows a $\text{Binomial}(n=10^4, p=10^{-3})$ distribution.
We can calculate the exact probabilities using this binomial distribution or
approximate probabilities using a $\text{Poisson}(\mu=np=10)$ distribution.
Either way, we will want to calculate the probabilities at a computer. For example,
the quickest way to calculating $P(X \geq 12)$ still requires calculating
\[ 1 - f(0) - f(1) - f(2) - \ldots - f(11), \]
which is not something we want to do by hand.
```
Here's how we would calculate the probability using the Python library [Symbulate](http://dlsun.github.io/symbulate).
```{python}
from symbulate import *
probs = Poisson(10).pmf(
range(12) # range(12) is [0, 1, 2, ..., 11]
)
1 - sum(probs)
```
Alternatively, we could also calculate this using the c.d.f. and the complement rule. This time,
let's compare the approximate answer that we get from the Poisson distribution with the
exact answer that we get from the binomial distribution.
```{python}
1 - Poisson(10).cdf(11), 1 - Binomial(n=10 ** 4, p=10 ** -3).cdf(11)
```
Very close indeed!
The second question can be addressed by trial and error. We want to figure out the value $c$ such that
$F(c) = P(X \leq c) = .99$. From above, we already know that $F(11) \approx 0.70$, so we can start
trying values from $c=12$. The simple script below increases $x$ by $1$ until the
cumulative probability crosses $.99$.
```{python}
x = 12
while Poisson(10).cdf(x) < .99:
x += 1
# print out the value of x and the probability
x, Poisson(10).cdf(x)
```
You can play around with the Python code in [this Colab notebook](https://colab.research.google.com/github/dlsun/probability/blob/master/colabs/py/Calculating_Poisson_Probabilities.ipynb).
It is also possible to do this calculation in R, a statistical programming language.
```{r}
probs <- dpois(0:11, 10)
1 - sum(probs)
```
We can also use the c.d.f. function:
```{r}
1 - ppois(11, 10)
```
The second question can be addressed by trial and error. We want to figure out the value $c$ such that
$F(c) = P(X \leq c) = .99$. From above, we already know that $F(11) \approx 0.70$, so we can start
trying values from $c=12$. The simple script below increases $x$ by $1$ until the c.d.f. crosses $.99$.
```{r}
x <- 12
while(ppois(x, 10) < .99) {
x <- x + 1
}
# print out the value of x and the probability
c(x, ppois(x, 10))
```
You can play around with the R code in [this Colab notebook](https://colab.research.google.com/github/dlsun/probability/blob/master/colabs/r/Calculating_Poisson_Probabilities.ipynb).
## Essential Practice {-}
1. If you buy a lottery ticket in 50 lotteries, in each of which your chances of winning a prize of $1/100$, what
is the probability that you will win a prize:
a. at least once?
b. exactly twice?
c. at least twice?
Calculate both the exact probabilities (using the binomial distribution) and the approximate probabilities
(using the Poisson distribution).
2. The number of organisms in $V$ cubic meters of ballast water discharged from a ship follows a $\text{Poisson}(\mu=10 V)$
distribution. (See "Counting at Low Concentrations: The Statistical Challenges of Verifying Ballast Water Discharge
Standards", _Ecological Applications_, 2013:339–351.)
a. What is the probability that there are at least 12 organisms in 1.5 cubic meters of discharge?
b. For what amount of discharge would the probability of containing at least one organism be .999?
3. Thelma calculates the exact probability of winning more than 50% of the time when she places
1000 bets on red in roulette. Louise calculates an approximate probability using the Poisson distribution.
They get very different answers. What answers did they get, and why did the Poisson approximation fail in this case?