06-probability.Rmd

# Probability

## Load packages, load data, set theme

Let's load the packages that we need for this chapter. 

```{r, message=FALSE}
library("knitr")        # for rendering the RMarkdown file
library("kableExtra")   # for nicely formatted tables
library("arrangements") # fast generators and iterators for creating combinations
library("DiagrammeR")   # for drawing diagrams
library("tidyverse")    # for data wrangling 
```

Set the plotting theme.

```{r}
theme_set(theme_classic() + 
            theme(text = element_text(size = 20)))

opts_chunk$set(comment = "",
               fig.show = "hold")
```


## Counting

Imagine that there are three balls in an urn. The balls are labeled 1, 2, and 3. Let's consider a few possible situations. 

```{r}
balls = 1:3 # number of balls in urn 
ndraws = 2 # number of draws

# order matters, without replacement
permutations(balls, ndraws)

# order matters, with replacement
permutations(balls, ndraws, replace = T)

# order doesn't matter, with replacement 
combinations(balls, ndraws, replace = T)

# order doesn't matter, without replacement 
combinations(balls, ndraws)
```

I've generated the figures below using the `DiagrammeR` package. It's a powerful package for drawing diagrams in R. See information on how to use the DiagrammeR package [here](https://rich-iannone.github.io/DiagrammeR/). 

```{r, echo=FALSE, fig.cap="Drawing two marbles out of an urn __with__ replacement."}
grViz("
digraph dot{
  
  # general settings for all nodes
  node [
    shape = circle,
    style = filled,
    color = black,
    label = ''
    fontname = 'Helvetica',
    fontsize = 24,
    fillcolor = lightblue
    ]
  
  # edges between nodes
  edge [color = black]
  0 -> {1 2 3}
  1 -> {11 12 13}
  2 -> {21 22 23}
  3 -> {31 32 33}
  
  # labels for each node
  0 [fillcolor = 'black', width = 0.1]
  1 [label = '1']
  2 [label = '2']
  3 [label = '3']
  11 [label = '1']
  12 [label = '2']
  13 [label = '3']
  21 [label = '1']
  22 [label = '2']
  23 [label = '3']
  31 [label = '1']
  32 [label = '2']
  33 [label = '3']
    
  # direction in which arrows are drawn (from left to right)
  rankdir = LR
}
")

```

```{r, echo=FALSE, fig.cap="Drawing two marbles out of an urn __without__ replacement."}
grViz("
digraph dot{
  
  # general settings for all nodes
  node [
    shape = circle,
    style = filled,
    color = black,
    label = ''
    fontname = 'Helvetica',
    fontsize = 24,
    fillcolor = lightblue
    ]
  
  # edges between nodes
  edge [color = black]
  0 -> {1 2 3}
  1 -> {12 13}
  2 -> {21 23}
  3 -> {31 32}
  
  # labels for each node
  0 [fillcolor = 'black', width = 0.1]
  1 [label = '1']
  2 [label = '2']
  3 [label = '3']
  12 [label = '2']
  13 [label = '3']
  21 [label = '1']
  23 [label = '3']
  31 [label = '1']
  32 [label = '2']
  
  # direction in which arrows are drawn (from left to right)
  rankdir = LR
}
")
```

## The random secretary

A secretary types four letters to four people and addresses the four envelopes. If he inserts the letters at random, each in a different envelope, what is the probability that exactly three letters will go into the right envelope?

```{r}
df.letters = permutations(x = 1:4, k = 4) %>% 
  as_tibble(.name_repair = ~ str_c("person_", 1:4)) %>%
  mutate(n_correct = (person_1 == 1) + 
           (person_2 == 2) + 
           (person_3 == 3) +
           (person_4 == 4))

df.letters %>% 
  summarize(prob_3_correct = sum(n_correct == 3) / n())
```

```{r}
ggplot(data = df.letters,
       mapping = aes(x = n_correct)) + 
  geom_bar(aes(y = stat(count)/sum(count)),
           color = "black",
           fill = "lightblue") +
  scale_y_continuous(labels = scales::percent,
                     expand = c(0, 0)) + 
  labs(x = "number correct",
       y = "probability")
```

## Flipping a coin many times

```{r, fig.cap='A demonstration of the law of large numbers.'}

# Example taken from here: http://statsthinking21.org/probability.html#empirical-frequency

set.seed(1) # set the seed so that the outcome is consistent
nsamples = 50000 # how many flips do we want to make?

# create some random coin flips using the rbinom() function with
# a true probability of 0.5

df.samples = tibble(trial_number = seq(nsamples), 
                    outcomes = rbinom(nsamples, 1, 0.5)) %>% 
  mutate(mean_probability = cumsum(outcomes) / seq_along(outcomes)) %>% 
  filter(trial_number >= 10) # start with a minimum sample of 10 flips

ggplot(data = df.samples, 
       mapping = aes(x = trial_number, y = mean_probability)) +
  geom_hline(yintercept = 0.5, color = "gray", linetype = "dashed") +
  geom_line() +
  labs(x = "Number of trials",
       y = "Estimated probability of heads") +
  theme_classic() +
  theme(text = element_text(size = 20))
```

## Clue guide to probability

```{r}
who = c("ms_scarlet", "col_mustard", "mrs_white",
        "mr_green", "mrs_peacock", "prof_plum")
what = c("candlestick", "knife", "lead_pipe",
         "revolver", "rope", "wrench")
where = c("study", "kitchen", "conservatory",
          "lounge", "billiard_room", "hall",
          "dining_room", "ballroom", "library")

df.clue = expand_grid(who = who,
                      what = what,
                      where = where)

df.suspects = df.clue %>% 
  distinct(who) %>% 
  mutate(gender = ifelse(test = who %in% c("ms_scarlet", "mrs_white", "mrs_peacock"), 
                         yes = "female", 
                         no = "male"))
```

```{r}
df.suspects %>% 
  arrange(desc(gender)) %>% 
  kable() %>% 
  kable_styling("striped", full_width = F)
```

### Conditional probability

```{r}
# conditional probability (via rules of probability)
df.suspects %>% 
  summarize(p_prof_plum_given_male = 
              sum(gender == "male" & who == "prof_plum") /
              sum(gender == "male"))
```
```{r}
# conditional probability (via rejection)
df.suspects %>% 
  filter(gender == "male") %>% 
  summarize(p_prof_plum_given_male = 
              sum(who == "prof_plum") /
              n())
```

### Law of total probability

```{r, echo=FALSE}
grViz("
digraph dot{
  
  # general settings for all nodes
  node [
    shape = circle,
    style = filled,
    color = black,
    label = ''
    fontname = 'Helvetica',
    fontsize = 9,
    fillcolor = lightblue,
    fixedsize=true,
    width = 0.8
    ]
  
  # edges between nodes
  edge [color = black,
        fontname = 'Helvetica',
        fontsize = 10]
  1 -> 2 [label = 'p(female)']
  1 -> 3 [label = 'p(male)']
  2 -> 4 [label = 'p(revolver | female)'] 
  3 -> 4 [label = 'p(revolver | male)']
  
  
  # labels for each node
  1 [label = 'Gender?']
  2 [label = 'If female\nuse revolver?']
  3 [label = 'If male\nuse revolver?']
  4 [label = 'Revolver\nused?']
  
  rankdir='LR'
  }"
)
```

## Probability operations

```{r}
# Make a deck of cards 
df.cards = tibble(suit = rep(c("Clubs", "Spades", "Hearts", "Diamonds"), each = 8),
                  value = rep(c("7", "8", "9", "10", "Jack", "Queen", "King", "Ace"), 4)) 
```

```{r}
# conditional probability: p(Hearts | Queen) (via rules of probability)
df.cards %>% 
  summarize(p_hearts_given_queen = 
              sum(suit == "Hearts" & value == "Queen") / 
              sum(value == "Queen"))
```

```{r}
# conditional probability: p(Hearts | Queen) (via rejection)
df.cards %>% 
  filter(value == "Queen") %>%
  summarize(p_hearts_given_queen = sum(suit == "Hearts")/n())
```

## Bayesian reasoning explained

```{r, echo=FALSE}
grViz("
digraph dot{
  
  # general settings for all nodes
  node [
    shape = circle,
    style = filled,
    color = black,
    label = ''
    fontname = 'Helvetica',
    fontsize = 10,
    fillcolor = lightblue,
    fixedsize=true,
    width = 0.8
    ]
  
  # edges between nodes
  edge [color = black,
        fontname = 'Helvetica',
        fontsize = 10]
  1 -> 2 [label = 'ill']
  1 -> 3 [label = 'healthy']
  2 -> 4 [label = 'test +'] 
  2 -> 5 [label = 'test -']
  3 -> 6 [label = 'test +']
  3 -> 7 [label = 'test -']
  

  # labels for each node
  1 [label = '10000\npeople']
  2 [label = '100']
  3 [label = '9900']
  4 [label = '95']
  5 [label = '5']
  6 [label = '495']
  7 [label = '9405']
  
  rankdir='LR'
  }"
)
```

## Getting Bayes right matters

### Bayesian reasoning example

```{r}
# prior probability of the disease
p.D = 0.0001

# sensitivity of the test 
p.T_given_D = 0.999

# specificity of the test 
p.notT_given_notD = 0.999
p.T_given_notD = (1 - p.notT_given_notD)

# posterior given a positive test result 
p.D_given_T = (p.T_given_D * p.D) / ((p.T_given_D * p.D) + (p.T_given_notD * (1-p.D)))

p.D_given_T
```

### Bayesian reasoning example (COVID rapid test)

https://pubmed.ncbi.nlm.nih.gov/34242764/#:~:text=The%20overall%20sensitivity%20of%20the,%25%20CI%2024.4%2D65.1).

```{r}
# prior probability of the disease
p.D = 0.1 

# sensitivity covid rapid test
p.T_given_D = 0.653

# specificity of covid rapid test
p.notT_given_notD = 0.999

p.T_given_notD = (1 - p.notT_given_notD)

# posterior given a positive test result 
p.D_given_T = (p.T_given_D * p.D) / ((p.T_given_D * p.D) + (p.T_given_notD * (1-p.D)))

# posterior given a negative test result 
p.D_given_notT = ((1-p.T_given_D) * p.D) / (((1-p.T_given_D) * p.D) + ((1-p.T_given_notD) * (1-p.D)))

str_c("Probability of COVID given a positive test: ", round(p.D_given_T * 100, 1), "%")
str_c("Probability of COVID given a negative test: ", round(p.D_given_notT * 100, 1), "%")
```

### Most people in the hospital are vaccinated

```{r}
# probability of being vaccinated 
p.V = 0.8 

# likelihood of hospital 
p.H_given_V = 0.2
p.H_given_notV = 0.5

# posterior probability 
p.V_given_H = (p.H_given_V * p.V) / ((p.H_given_V * p.V) + (p.H_given_notV * (1-p.V)))

p.V_given_H
```

## Building a Bayesis

### Dice example

```{r}
# prior
p.four = 0.5
p.six = 0.5

# possibilities 
df.possibilities = tibble(observation = 1:6,
                          p.four = c(rep(1/4, 4), rep(0, 2)),
                          p.six = c(rep(1/6, 6)))

# data
# data = c(4)
# data = c(4, 2, 1)
data = c(4, 2, 1, 3, 1)
# data = c(4, 2, 1, 3, 1, 5)

# likelihood
p.data_given_four = prod(df.possibilities$p.four[data])
p.data_given_six = prod(df.possibilities$p.six[data])

# posterior
p.four_given_data = (p.data_given_four * p.four) /  
  ((p.data_given_four * p.four) + 
     (p.data_given_six * p.six))

p.four_given_data
```

Given this data $d$ = [`r data`], there is a `r round(p.four_given_data * 100)`% chance that the four sided die was rolled rather than the six sided die. 

## Additional resources

### Cheatsheets

- [Probability cheatsheet](figures/probability.pdf)

### Books and chapters

- [Probability and Statistics with examples using R](http://www.isibang.ac.in/~athreya/psweur/)
- [Learning statistics with R: Chapter 9 Introduction to probability](https://learningstatisticswithr-bookdown.netlify.com/probability.html#probstats)

### Misc

- [Bayes' theorem in three panels](https://www.tjmahr.com/bayes-theorem-in-three-panels/)
- [Statistics 110: Probability; course at Harvard](https://projects.iq.harvard.edu/stat110)  
- [Bayes theorem and making probability intuitive](https://www.youtube.com/watch?v=HZGCoVF3YvM&feature=youtu.be)

## Session info

Information about this R session including which version of R was used, and what packages were loaded. 

```{r, echo=F}
sessionInfo()
```