diff --git a/_bookdown.yml b/_bookdown.yml index fc05345..3ac8552 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -7,7 +7,8 @@ output_dir: published rmd_files: [ "index.Rmd", "setup.Rmd", - "univariate-distributions.Rmd", + "discrete-distributions.Rmd", + "continuous-distributions.Rmd", "joint-distributions.Rmd", "clt.Rmd", "data-analysis.Rmd", diff --git a/continuous-distributions.Rmd b/continuous-distributions.Rmd new file mode 100644 index 0000000..4486280 --- /dev/null +++ b/continuous-distributions.Rmd @@ -0,0 +1,205 @@ +# Continous Probability Distributions + +## Learning Objectives {-#objectives-continous-distributions} + +1. Define and explain the key characteristics of the continous distributions: + - normal, + - lognormal, + - exponential, + - gamma, + - chi-square, + - $t$, + - $F$, + - beta and + - uniform on an interval. +2. Evaluate probabilities and quantiles associated with such distributions. +3. Generate discrete and continous random variables using the inverse transform method. + +## Theory {-#theory-continous-distributions} + +**A reminder:** `R` was designed to be used for statistical computing - so it handles **randomness** well! Using `R` we can guarantee reproducibility (and enhance sharability) by using the function `set.seed(seed)`{.R} where `seed` is a single value integer. Using this approach we guarantee the generation of the same sequence of random numbers everytime we call this function. Use `?set.seed`{.R} to learn more about this function. + +## In-built probability distributions + +**A recap:** `R` has in-built functions for probability distributions: + +- **d***\* $:=$ **d**ensity ("_PDF_"), *i.e.* $f_X(x)$ +- **p***\* $:=$ **p**robability distribution cumulative function ("_CDF_"), *i.e.* $F_X(x) =\boldsymbol{P}(X \leq x)$ +- **q***\* $:=$ **q**uantile function, *i.e.* return $x$ such that $\boldsymbol{P}(X \leq x) = p$ +- **r***\* $:=$ **r**andom deviates, *i.e.* (*psuedo*) random number generator for a given distribution +- *Where \* $=$ Normal, uniform, lognormal, Student's $t$, Poisson, binormal, Weibull ... see `?distributions()`{.R} for more information + +To give some quick examples (we will further explore these in more detail later in this chapter): + +| R Code | Definition | +|-|-| +| `rnorm(1)`{.R} | Generates $x_1$ where $X \sim \mathcal{N}(0,\,1)$ | +| `rnorm(y, mean=10, sd=2)`{.R} | Generates $\{y_1,\,y_2,\,\dots\}$ with $Y \sim \mathcal{N}(10,\,2^2)$ | +| `runif(3, min=5, max=10)`{.R} | Generates $\{z_1,\,z_2,\,z_3\}$ where $Z \sim \mathcal{U}(5,\,10)$ | +| `dbinom(4, size=5, prob=0.5)`{.R} | Computes $\boldsymbol{P}(X = 4)$ where $X \sim Bin(5,\,0.5)$ | +| `pgamma(0.2, shape=2, rate=2)`{.R} | Computes $F_Y(0.2)$ where $Y \sim \mathcal{\Gamma}(2,\,2)$, i.e. $\boldsymbol{P}(Y\leq 0.2)$| +| `qexp(0.5, rate = 2)`{.R} | Determines smallest value of $z$ for $\boldsymbol{P}(Z \leq z) = 0.5$ where $Z \sim Exp(2)$ | + +### Continous probability distributions covered {-} + +We will consider how to interact with the following continous probability distributions in `R`: + +- Normal +- Lognormal +- Exponential +- Gamma +- $\chi^2$ +- Student's $t$ +- $F$ +- Beta +- Uniform (on an interval) + +For each distribution above we will determine how to calculate: + +- A random deviate following the discrete distribution $X$, +- The probability density function ("_PDF_"), $P(k_1 \leq X \leq k_2)$ for distribution $X$ over the range $[k_1,\,k_2]$, +- The cumulative distribution function ("_CDF_"), $P(X \leq k)$, and +- The quantile function to find $k$ representing the value such that $P(X \leq k) = p$, i.e. the _pth_ percentile. + +We will finish off with a plot of the distribution. + +## Normal distribution + +We start with generating random deviates from the Normal distribution. + +If we are interested in the standard normal, `R` helpfully has default argument values for $\mu = 0$ and $\sigma = 0$ so the function call is very concise: + +```{r continuous-distributions-normal-1} +# Generate random deviates +set.seed(42) +rnorm(5) +``` + +We can also specify our own values of $\mu$ and $\sigma$. With `R` we need to remember that the $\sigma$ argument corresponds to the standard deviation, **not** the variance. + +```{r continuous-distributions-normal-2} +# Generate random deviates +set.seed(42) +rnorm(5, 10, 2) +``` + +We next look at the cumulative distribution function for a Normal distribution. In `R` we can calculate this using `pnorm(q, mean, sd)`{.R} where: + +- _q_ is the quantile of interest, +- _mean_ is the mean, and +- _sd_ is standard deviation. + +```{r continuous-distributions-normal-3} +# Calculate P(X <= 2) for X~N(0,1) +``` + +Next we want to find the xth percentile of $X \sim N(\mu, \sigma)$. We use the quantile function, `qnorm(mu, sigma)`{.R}. + +```{r continuous-distributions-normal-4} +# Find the 99th percentile for X~N(10,2) +percentile_99 <- qnorm(0.99, 10, 2) + +paste0( + "The 99th percentile of X~N(10, 2) is ", + format(percentile_99, digits = 4), + "." +) +``` + +As is customary we finish with a plot of the normal distribution. + +```{r continuous-distributions-normal-5} +# Plot the distribution function of X~Normal(mu =, sigma = ) +``` + +## Lognormal distribution + +```{r continuous-distributions-lognormal-1} + +# Plot the distribution function of Y~Lognormal() +``` + +## Exponential distribution + +```{r continuous-distributions-exponential-1} + +# Plot the distribution function of Z~Exp() +``` + +## Gamma distribution + +```{r continuous-distributions-gamma-1} + +# Plot the distribution of X~Gamma() +``` + +## $\chi^2$ distribution + +```{r continuous-distributions-chi-squared-1} + +# Plot the distribution of Y~Chi-Square() +``` + +## Student's $t$ distribution + +```{r continuous-distributions-t-1} + +# Plot the distribution of Z~t() +``` + +## $F$ distribution + +```{r continuous-distributions-F-1} + +# Plot the distribution of X~F() +``` + +## Beta distribution + +```{r continuous-distributions-Beta-1} + +# Plot the distribution of Y~Beta() +``` + +## Uniform distribution {#continous-uniform} + +```{r continuous-distributions-uniform-1} + +# Plot the distribution of Z~Uniform() +``` + +## Inverse transform method + +The inverse transform method is a way to generate psuedo-random numbers from any probability distribution. + +One possible algorithm is as follows: + +1. Generate a random number $u$ from $U \sim \mathcal{U}(0, 1) +2. Find the inverse of the desired cumulative distribution function, $F^{-1}_X(x)$ +3. Compute $X = F^{-1}_X(u)$ + +Suppose we wanted to draw 10,000 random numbers from $X \sim Exp(\lambda = 2)$. In order to use the inverse transform method we first need to find the inverse of the CDF. For any $X \sim Exp(\lambda)$, the inverse of the CDF is $\frac{-log(1-x)}{\lambda}$. We can thus use the inverse transform algorithm to generate random deviates following $X \sim Exp(2)$: + +```{r continuous-distributions-inverse-transform-1} +# Step 0 - to guarantee reproducibility +set.seed(42) + +# Step 1 - generate 10,000 random deviates from U[0,1] +u <- runif(10000) + +# Step 2 - find the inverse of the CDF: 1 - exp(-lambda.x) +# Inverse of CDF = -log(1 - x) / lambda + +# Step 3 - compute X using the inverse of the CDF [from step 2] and the random deviates u [from step 1] +x <- -log(1 - u) / 2 + +# Plot the resulting x deviates +library(ggplot2) +df <- data.frame(x = x) +ggplot(df, aes(x=x)) + + geom_histogram(binwidth = 0.5, colour="black", fill="white") +``` + +## `R` Practice {-#practice-continous-distributions} + +We finish with a comprehensive example of an univariate continous distribution question in `R`. \ No newline at end of file diff --git a/univariate-distributions.Rmd b/discrete-distributions.Rmd similarity index 59% rename from univariate-distributions.Rmd rename to discrete-distributions.Rmd index ce30f9c..28f1764 100644 --- a/univariate-distributions.Rmd +++ b/discrete-distributions.Rmd @@ -1,6 +1,6 @@ -# Univarite Probability Distributions +# Discrete Probability Distributions -## Learning Objectives {-#objectives-univariate-distributions} +## Learning Objectives {-#objectives-discrete-distributions} 1. Define and explain the key characteristics of the discrete distributions: - geometric, @@ -9,21 +9,9 @@ - hypergeometric, - Poisson and - uniform on a finite set. -2. Define and explain the key characteristics of the continous distributions: - - normal, - - lognormal, - - exponential, - - gamma, - - chi-square, - - $t$, - - $F$, - - beta and - - uniform on an interval. -3. Evaluate probabilities and quantiles associated with distributions. -4. Define and explain the key characteristics of the Poisson process and explain the connection between the Poisson process and the Poisson distribution. -5. Generate discrete and continous random variables using the inverse transform method. - -## Theory {-#theory-univariate-distributions} +2. Evaluate probabilities and quantiles associated with such distributions. + +## Theory {-#theory-discrete-distributions} `R` was designed to be used for statistical computing - so it handles **randomness** well! @@ -31,7 +19,7 @@ Using `R` we can guarantee reproducibility (and enhance sharability) by using th Let's see `set.seed` in action: -```{r univariate-distributions-randomness} +```{r discrete-distributions-randomness} # We make five random draws from the integer range [1, 10] # We cannot guarantee reproducing this outcome when sharing the code: sample(1:10, 5) @@ -55,13 +43,13 @@ sample(1:10, 5) `R` has in-built functions for probability distributions: -- **d***\* $:=$ **d**ensity (PDF), *i.e.* $f_X(x)$ -- **p***\* $:=$ **p**robability distribution cumulative function (CDF), *i.e.* $F_X(x) =\boldsymbol{P}(X \leq x)$ +- **d***\* $:=$ **d**ensity ("_PDF_"), *i.e.* $f_X(x)$ +- **p***\* $:=$ **p**robability distribution cumulative function ("_CDF_"), *i.e.* $F_X(x) =\boldsymbol{P}(X \leq x)$ - **q***\* $:=$ **q**uantile function, *i.e.* return $x$ such that $\boldsymbol{P}(X \leq x) = p$ - **r***\* $:=$ **r**andom deviates, *i.e.* (*psuedo*) random number generator for a given distribution - *Where \* $=$ Normal, uniform, lognormal, Student's $t$, Poisson, binormal, Weibull ... see `?distributions()`{.R} for more information -To give some quick examples (we will explore these in more detail later in this chapter): +To give some quick examples (we will explore these in more detail later in this chapter and the next chapter): | R Code | Definition | |-|-| @@ -72,7 +60,7 @@ To give some quick examples (we will explore these in more detail later in this | `pgamma(0.2, shape=2, rate=2)`{.R} | Computes $F_Y(0.2)$ where $Y \sim \mathcal{\Gamma}(2,\,2)$, i.e. $\boldsymbol{P}(Y\leq 0.2)$| | `qexp(0.5, rate = 2)`{.R} | Determines smallest value of $z$ for $\boldsymbol{P}(Z \leq z) = 0.5$ where $Z \sim Exp(2)$ | -## Discrete probability distributions +### Discrete probability distributions covered {-} We will consider how to interact with the following discrete probability distributions in `R`: @@ -83,7 +71,17 @@ We will consider how to interact with the following discrete probability distrib - Poisson - Uniform (on a finite set) -### Geometric distribution +For each distribution above we will determine how to calculate: + +- A random deviate following the discrete distribution $X$, +- The probability mass function ("_PMF_"), $P(X = k)$ for distribution $X$ and $-\infty < k < \infty$ (noting the _PMF_ $= 0$ for most values of $k$), +- The cumulative distribution function ("_CDF_"), $P(X \leq k)$, +- A range of PMFs, i.e. $P(k_1 \leq X \leq k_2)$, and +- The quantile function to find $k$ representing the value such that $P(X \leq k) = p$, i.e. the _pth_ percentile. + +We will finish off with a plot of the distribution. + +## Geometric distribution The geometric probability distribution ($X$) can be defined by the number of failures ($k$) in Bernoulli trials before achieving a success. A Bernoulli trial is a random experiment with exactly two outcomes, \{"_success_", "_failure_"\}, in which the probability of success is constant in every experiment. More formally, if we have $k \in \{0,\,1,\,2,\, \dots \}$ independent, identically distributed ("_i.i.d._") Bernoulli trials before a "_success_" with the $P($"_success_"$)$ on each trial defined as $p$, then: @@ -96,7 +94,7 @@ In `R` we can generate random deviates following a geometric distribution using - _n_ is the _number_ of random deviates we want to generate, and - _prob_ is the _probability_ of success in any given Bernoulli trial. -```{r univariate-distributions-geometric-1} +```{r discrete-distributions-geometric-1} # Guarantee reproducibility set.seed(42) @@ -104,9 +102,9 @@ set.seed(42) rgeom(5, 0.25) ``` -If we wanted to find the xth percentile of $X \sim Geo(p)$ we would use the quantile function, `qgeom(p, prob)`{.R}. It is also worth noting that `qgeom()`{.R} has an optional argument `lower.tail` taking Boolean inputs (_TRUE_, _FALSE_) with the default being _TRUE_. We can use this to find the top `p` percentile with `qgeom(p, prob, lower.tail = FALSE)`{.R}. Alternatively we can just plug in 1 - `p` with the default argument value. +If we wanted to find the xth percentile of $X \sim Geo(p)$ we would use the quantile function, `qgeom(p, prob)`{.R}. -```{r univariate-distributions-geometric-2} +```{r discrete-distributions-geometric-2} # Find the 99th percentile of X~Geo(0.25) percentile_99 <- qgeom(0.99, 0.25) @@ -122,7 +120,7 @@ If we wish to find $P(X \leq k)$ then we need to use the function for the cumula - _q_ equates to $k$ failures before success in a sequence of i.i.d. Bernoulli trials, - _prob_ is the _probability_ of success in any given Bernoulli trial. -```{r univariate-distributions-geometric-3} +```{r discrete-distributions-geometric-3} # Find P(X <= 7) with X~Geo(0.25) prob_geom <- pgeom(7, 0.25) @@ -135,7 +133,7 @@ paste0( We can find $P(X = k)$ using the `dgeom()`{.R} function call: -```{r univariate-distributions-geometric-4} +```{r discrete-distributions-geometric-4} # Find P(X = 7) with X~Geo(0.25) prob_geom <- dgeom(7, 0.25) ``` @@ -143,14 +141,14 @@ We find that $P(X = 7) =$ `r toString(prob_geom)`. It helps to visualise the probability distribution. We _can_ do this in base `R` with the `plot()`{.R} function. However for illustrative purposes we will use the `ggplot2` package from the [`tidyverse`](https://www.tidyverse.org/packages/) suite of packages. `ggplot2` allows for rich visualisations^[For inspiration consider the BBC's [cookbook](https://bbc.github.io/rcookbook/) that highlights how to create their distinctive graphics predominately using `ggplot2`.] using a consistent syntax. -```{r univariate-distributions-geometric-5} +```{r discrete-distributions-geometric-5} # Plot the distribution function of X~Geo(0.25) library(ggplot2) df <- data.frame(x = rgeom(1000, 0.25)) ggplot(df, aes(x=x)) + geom_histogram(binwidth = 0.5) ``` -### Binomial distribution +## Binomial distribution The binomial probability distribution ($Y$) can be defined by the number of successes in a sequence of $n$ i.i.d. Bernoulli trials. A Bernoulli trial is a random experiment with exactly two outcomes, \{"_success_", "_failure_"\}, wherein the probability of success is $p$. We often state the probability of failure as $q = 1 - p$ for convenience. More formally, if we have $k \in \{0,\,1,\,2,\, \dots,\, n \}$ successes given $n$ i.i.d. Bernoulli trials with the $P($"_success_"$)$ on each trial defined as $p$, then: @@ -168,13 +166,13 @@ We can generate random deviates following the binomial distribution $Y \sim Bin( - _size_ is the number of i.i.d. Bernoulli trials performed, and - _prob_ is the probability of success in any given trial. -```{r univariate-distributions-binomial-1} +```{r discrete-distributions-binomial-1} # Generate 5 random deviates on Y~Bin(10, 0.55) set.seed(42) rbinom(5, 10, 0.55) ``` -To find $P(X <= k)$ we can use the function `pbinom(p, size, prob)`{.R} where: +To find $P(Y \leq k)$ we can use the function `pbinom(p, size, prob)`{.R} where: - _p_ is the probability of interest, - _size_ is the number of i.i.d. Bernoulli trials performed, and @@ -182,7 +180,7 @@ To find $P(X <= k)$ we can use the function `pbinom(p, size, prob)`{.R} where: Suppose we were told the probability of having a boy at birth was 0.5131451^[This probability was chosen based on _Table 1_ from [Sex ratios at birth in the United Kingdom, 2014-18](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/925066/Sex_ratios_at_birth_in_the_United_Kingdom__2014-18.pdf) from the Department of Health & Social Care.]. Additionally, suppose we were told that there were 1,264 births^[This turns out to be the number of live births in Guildford in 2019, as per _Table 3_ of the ONS' [Birth Summary Tables, England and Wales, 2019](https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/birthsummarytables). Source: Office for National Statistics under the Open Government Licence.]. Given this, we are curious to know what is the probability of there being less than 632 boys (_i.e. < 50%_) amongst the new babies. -```{r univariate-distributions-binomial-2} +```{r discrete-distributions-binomial-2} # Find P(Y <= 632) with Y~Binom(1264, 0.5131451) prob_binom <- pbinom(632, 1264, 0.5131451) @@ -201,7 +199,7 @@ To find $P(Y = k)$ we can use the function `dbinom(x, size, prob)`{.R} where: Throwing a fair coin, we are curious to compute the probability of observing 4 heads in a sequence of 5 tosses: -```{r univariate-distributions-binomial-3} +```{r discrete-distributions-binomial-3} # Find P(Y = 4) where Y~Binom(5, 0.5) prob_binom <- dbinom(4, 5, 0.5) @@ -216,7 +214,7 @@ If we wanted to find $P(k_1 \leq Y \leq k_2)$ we can also use the function `dbin Continuing the fair coin example, we are curious what is the probability of observing at least 1 head and at most 3 heads in a sequence of 5 tosses: -```{r univariate-distributions-binomial-4} +```{r discrete-distributions-binomial-4} # Find P(1 <= Y <= 3) where Y~Binom(5, 0.5) prob_binom <- sum(dbinom(1:3, 5, 0.5)) @@ -235,7 +233,7 @@ When we are interested in finding the _yth_ percentile we can use the quantile f Continuing the baby boys example, we now are curious what is the 99th percentile of baby births being boys. Recall that the probability (_p_) of having a boy at birth was 0.5131451 and that there were 1,264 births (_n_) in Guildford in 2019. -```{r univariate-distributions-binomial-5} +```{r discrete-distributions-binomial-5} # Find the 99th percentile of Y~Binom(1264, 0.5131451) percentile_99 <- qbinom(0.99, 1264, 0.5131451) @@ -248,11 +246,12 @@ paste0( Finally, we finish with a plot of binomial distributions and note that with large $n$ the shape of the distribution begins to form the normal distribution (as long as $p$ is not near the extremes of 0 or 1). -```{r univariate-distributions-binomial-6} +```{r discrete-distributions-binomial-6} # Plot the distribution function of X, Y, Z with: # X,Y,Z~Binom(n, p) for various {n, p} library(ggplot2) set.seed(42) + df <- data.frame( dist_types = factor( rep(c( @@ -275,10 +274,10 @@ ggplot(df, aes(x = dist_values, fill = dist_types)) + ylab("Count") + theme_minimal() + labs(fill = "") + - ggtitle("Histogram of binomial distribution for various n, p") + ggtitle("Histogram of binomial distributions for various n, p") ``` -### Negative binomial distribution +## Negative binomial distribution The negative binomial distribution ($Z$) can be defined by the number of failures in a sequence of $n$ independent, identically distributed ("_i.i.d._") Bernoulli trials before a specified number of successes, $r$, occurs. More formally, if we have, $r >0$ number of successes, $k$ number of failures, $p$ probability of success then: @@ -292,19 +291,19 @@ Let us begin by generate random deviates from the negative binomial distribution - _size_ is the target number of successes from i.i.d. Bernoulli trials, and - _prob_ is the probability of success in any given trial -```{r univariate-distributions-negative-binomial-1} +```{r discrete-distributions-negative-binomial-1} # Generate random deviates from the negative binomial distribution set.seed(42) rnbinom(5, 10, 0.5) ``` -To find $P(Z <= k)$ we can use the function `pnbinom(q, size, prob)`{.R} where: +To find $P(Z \leq k)$ we can use the function `pnbinom(q, size, prob)`{.R} where: - _q_ is the quantile of interest, - _size_ is the target number of successes from i.i.d. Bernoulli trials, - _prob_ is the probability of success in any given trial. -```{r univariate-distributions-negative-binomial-2} +```{r discrete-distributions-negative-binomial-2} # Find P(Z <= 2) with Z~NegBin(5, 0.5) prob_nbinom <- pnbinom(2, 5, 0.5) @@ -321,7 +320,7 @@ To find $P(Z = k)$ we can use the function `dnbinom(x, size, prob)`{.R} where: - _size_ is the target number of successes from i.i.d. Bernoulli trials, and - _prob_ is the probability of success in any given trial. -```{r univariate-distributions-negative-binomial-3} +```{r discrete-distributions-negative-binomial-3} # Find P(Z = 2) where Z~NegBin(5, 0.5) prob_nbinom <- dnbinom(2, 5, 0.5) @@ -334,7 +333,7 @@ paste0( We can also calculate $P(k_1 \leq Z \leq k_2)$ with `dnbinom(x, size, prob)`{.R} by entering _x_ as a vector input and summing over the output. -```{r univariate-distributions-negative-binomial-4} +```{r discrete-distributions-negative-binomial-4} # Find P(1 <= Z <= 3) with Z~NegBin(5, 0.5) prob_nbinom <- sum(dnbinom(1:3, 5, 0.5)) @@ -351,7 +350,7 @@ When we are interested in finding the _zth_ percentile we can make use of the qu - _size_ is the target number of successes from i.i.d. Bernoulli trials, and - _prob_ is the probability of success in any given trial. -```{r univariate-distributions-negative-binomial-5} +```{r discrete-distributions-negative-binomial-5} # Find the 99th percentile of Z~NegBin(100, 0.5) percentile_99 <- qnbinom(0.99, 100, 0.5) @@ -364,11 +363,12 @@ paste0( Finally, we finish with a plot of the negative binomial distribution: -```{r univariate-distributions-negative-binomial-6} +```{r discrete-distributions-negative-binomial-6} # Plot the distribution function of X, Y, Z with: # X,Y,Z~NegBin(r, p) for various {r, p} library(ggplot2) set.seed(42) + df <- data.frame( dist_types = factor( rep(c( @@ -391,42 +391,42 @@ ggplot(df, aes(x = dist_values, fill = dist_types)) + ylab("Count") + theme_minimal() + labs(fill = "") + - ggtitle("Histogram of negative binomial distribution for various r, p") + ggtitle("Histogram of negative binomial distributions for various r, p") ``` -### Hypergeometric distribution +## Hypergeometric distribution -The hypergeometric distribution ($X$) can be defined by the probability of $k$ successes in $n$ draws **without** replacement from a finite population $N$ that contains exactly $K$ objects. +The hypergeometric distribution ($X$) can be defined by the probability of $k$ successes in $n$ draws **without** replacement from a finite population $N$ that contains exactly $K$ "_success_" objects/states. $$P(X = k) = \frac{ \binom{K}{k} \binom{N - K}{n - k} }{ \binom{N}{n} }$$ We begin by generating random deviates from the hypergeometric distribution, $X \sim Hyper(N, K, n)$, using `rhyper(nn, m, n, k)`{.R} where: - _nn_ is _number_ of observations we want to generate, -- _m_ is number of "success" objects in the population $N$, -- _n_ is number of "failure" objects in the population $N$, and -- _k_ is the number of objects drawn +- _m_ is number of "success" objects/states in the population $N$, +- _n_ is number of "failure" objects/states in the population $N$, and +- _k_ is the number of objects drawn / states observed -Traditionally we have defined the population $N$ as an urn containing _m_ white balls and _n_ black balls and this is how arguments are specified in the `stats`{.R} package. +Traditionally we have defined the population $N$ as an urn containing _m_ white balls and _n_ black balls and this is how the distribution arguments are specified in the `stats`{.R} package. Note that _m_ $= K$ and _n_ $= N - K$. -```{r univariate-distributions-hypergeometric-1} -# Generate random deviates from the hypergeometric distribution +```{r discrete-distributions-hypergeometric-1} +# Generate random deviates on X~Hyper(N = 100, K = 90, n = 10) set.seed(42) -rhyper(5, 90, 10, 10) +rhyper(5, 90, 100 - 90, 10) ``` -In the above random deviate generation we specified a 90% ($frac{90}{90 + 10}$) chance of drawing a "success" object and we made 10 draws from this population of 100 objects. With such a high proportion of "_success_" objects it was not suprising to note we generated random deviates from this distribution close to 10/10. +In the above random deviate generation we specified a 90% ($\frac{90}{90 + 10}$) chance of drawing a "success" object and we made 10 draws from this population of 100 objects. With such a high proportion of "_success_" objects it was not suprising to note we generated random deviates from this distribution close to 10/10. -Next we wish to find $P(X <= k)$ which involves using `phyper(q, m, n, k)`{.R} where: +Next we wish to find $P(X \leq k)$ which involves using `phyper(q, m, n, k)`{.R} where: - _q_ is the quantile of interest (representing the number of "success" objects in the finite population $N$), - _m_ is the number of "success" objects in the population $N$, - _n_ is number of "failure" objects in the population $N$, and - _k_ is the number of objects drawn -```{r univariate-distributions-hypergeometric-2} -# Find P(X <= 5) with X~Hyper() -prob_hyper <- phyper(5, 90, 10, 10) +```{r discrete-distributions-hypergeometric-2} +# Find P(X <= 5) with X~Hyper(N = 100, K = 90, n = 10) +prob_hyper <- phyper(5, 90, 100 - 90, 10) paste0( "With 10 objects drawn from a total 100 possible, ", @@ -443,17 +443,16 @@ To find $P(X = k)$ we can use the function `dhyper(x, m, n, k)`{.R} where: - _n_ is number of "failure" objects in the population $N$, and - _k_ is the number of objects drawn -```{r univariate-distributions-hypergeometric-3} -# Find P(X = 5) with X~Hyper() -prob_hyper <- dhyper(5, 90, 10, 10) +```{r discrete-distributions-hypergeometric-3} +# Find P(X = 5) with X~Hyper(N = 100, K = 90, n = 10) +prob_hyper <- dhyper(5, 90, 100 - 90, 10) ``` We can also calculate $P(k_1 \leq X \leq k_2)$ with `dhyper(x, m, n, k)`{.R} by entering _x_ as a vector input and summing over the output. - -```{r univariate-distributions-hypergeometric-4} -# Find P(7 <= X <= 9) with X~Hyper() -prob_hyper <- sum(dhyper(7:9, 90, 10, 10)) +```{r discrete-distributions-hypergeometric-4} +# Find P(7 <= X <= 9) with X~Hyper(N = 100, K = 90, n = 10) +prob_hyper <- sum(dhyper(7:9, 90, 100 - 90, 10)) ``` When we are interested in finding the _xth_ percentile we can make use of the quantile function, `qhyper(p, m, n, k)`{.R} where: @@ -463,9 +462,9 @@ When we are interested in finding the _xth_ percentile we can make use of the qu - _n_ is number of "failure" objects in the population $N$, and - _k_ is the number of objects drawn -```{r univariate-distributions-hypergeometric-5} -# Find the 99th percentile of X~Hyper() -percentile_99 <- qhyper(0.99, 90, 10, 50) +```{r discrete-distributions-hypergeometric-5} +# Find the 99th percentile of X~Hyper(N = 100, K = 90, n = 50) +percentile_99 <- qhyper(0.99, 90, 100 - 90, 50) paste0( "When we draw 50 objects from a population of ", @@ -477,203 +476,260 @@ paste0( ``` Finally, we finish with a plot of the hypergeometric distribution: -```{r univariate-distributions-hypergeometric-6} +```{r discrete-distributions-hypergeometric-6} # Plot the distribution function of X, Y, Z with: -# X,Y,Z~Hyper() for various {xx, xx} +# X,Y,Z~Hyper(N, K, n) for various {N, K, n} library(ggplot2) set.seed(42) -``` -### Poisson distribution +inputs <- c( + "N = 500, K = 50, n = 100", + "N = 500, K = 60, n = 200", + "N = 500, K = 70, n = 300" +) -```{r univariate-distributions-poisson-2} -# Find P(Y <= 5) -``` -```{r univariate-distributions-poisson-3} -# Find P(Y = 5) -``` -```{r univariate-distributions-poisson-4} -# Find the 99th percentile of Y~Poisson() -``` -```{r univariate-distributions-poisson-5} -# Plot the distribution function of Y~Poisson() +df <- data.frame( + dist_types = rep(inputs, each = 1000), + dist_values = c( + rhyper(1000, 50, 500 - 50, 100), + rhyper(1000, 60, 500 - 60, 200), + rhyper(1000, 70, 500 - 70, 300) + ) +) + +ggplot(df, aes(x = dist_values, fill = dist_types)) + + geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") + + xlab("Distribution values") + + ylab("Count") + + theme_minimal() + + labs(fill = "") + + scale_fill_discrete(breaks = inputs) + + ggtitle("Histogram of Hypergeometric distributions for various inputs") ``` -### Uniform distribution - discrete +## Poisson distribution -Our final discrete probability distribution concerns the uniform distribution defined over a finite set. +The Poisson distribution ($Y$) is usually defined as the probability of a given number of events occurring ($k$) in a fixed interval of time where events occur with a **constant mean rate** ($\lambda$) and **independently** of the time since the last event. More formally, for $\lambda > 0$ and $k = 0,\,1,\,2,\,\dots\,$: -As an example we can simulate a die throw using $Z \sim \mathcal{U}(1, 6) +$$P(Y = k) = \frac{\lambda^k e^{-k}}{k!}$$ -```{r univariate-distributions-uniform-discrete-2} -# Find P(Z <= 3) -``` +We can generate random deviates from the Poisson distribution $Y \sim Pois(\lambda)$ using `rpois(n, lambda)`{.R} where: -```{r univariate-distributions-uniform-discrete-3} -# Find P(Z = 2) -``` +- _n_ is the _number_ of random deviates we want to generate, and +- _lambda_ is the constant _mean_ rate. -```{r univariate-distributions-uniform-discrete-4} -# Plot the distribution function of Z~Uniform(1, 6) +```{r discrete-distributions-poisson-1} +# Generate 5 random deviates on Y~Pois(8) +set.seed(42) +rpois(5, 8) ``` -## Continous probability distributions - -We will now consider how to interact with the following continous probability distributions in `R`: - -- Normal -- Lognormal -- Exponential -- Gamma -- $\chi^2$ -- Student's $t$ -- $F$ -- Beta -- Uniform (on an interval) - -### Normal distribution +To find $P(Y \leq k)$ we can use the function `ppois(q, lambda)`{.R} where: -We start with generating random deviates from the Normal distribution. +- _q_ is the quantile of interest, and +- _lambda_ is the constant _mean_ rate. -If we are interested in the standard normal, `R` helpfully has default argument values for $\mu = 0$ and $\sigma = 0$ so the function call is very concise: +```{r discrete-distributions-poisson-2} +# Find P(Y <= 5) with Y~Pois(8) +prob_pois <- ppois(5, 8) -```{r univariate-distributions-normal-1} -# Generate random deviates -set.seed(42) -rnorm(5) +paste0( + "P(Y <= 5) with Y~Pois(5, 8) is ", + format(prob_pois, digits = 4), + "." +) ``` -We can also specify our own values of $\mu$ and $\sigma$. With `R` we need to remember that the $\sigma$ argument corresponds to the standard deviation, **not** the variance. +To find $P(Y = k)$ we can use the function `dpois(x, lambda)`{.R} where: -```{r univariate-distributions-normal-2} -# Generate random deviates -set.seed(42) -rnorm(5, 10, 2) -``` +- _x_ is the quantile of interest, and +- _lambda_ is the constant _mean_ rate. -We next look at the cumulative distribution function for a Normal distribution. In `R` we can calculate this using `pnorm(q, mean, sd)`{.R} where: +```{r discrete-distributions-poisson-3} +# Find P(Y = 5) with Y~Pois(8) +prob_pois <- dpois(5, 8) -- _q_ is the quantile of interest, -- _mean_ is the mean, and -- _sd_ is standard deviation. - -```{r univariate-distributions-normal-3} -# Calculate P(X <= 2) for X~N(0,1) +paste0( + "P(Y = 5) with Y~Pois(5, 8) is ", + format(prob_pois, digits = 4), + "." +) ``` -Next we want to find the xth percentile of $X \sim N(\mu, \sigma)$. We use the quantile function, `qnorm(mu, sigma)`{.R}. +If we wanted to find $P(k_1 \leq Y \leq k_2)$ we can also use the function `dpois(x, lambda)`{.R} with _x_ entered as a vector and summing over the output. -```{r univariate-distributions-normal-4} -# Find the 99th percentile for X~N(10,2) -percentile_99 <- qnorm(0.99, 10, 2) +```{r discrete-distributions-poisson-4} +# Find P(4 <= Y <= 6) where Y~Pois(8) +prob_pois <- sum(dpois(4:6, 5)) paste0( - "The 99th percentile of X~N(10, 2) is ", - format(percentile_99, digits = 4), + "P(4 <= Y <= 6) with Y~Pois(8) is ", + format(prob_pois, digits = 4), "." ) ``` -As is customary we finish with a plot of the normal distribution. - -```{r univariate-distributions-normal-5} -# Plot the distribution function of X~Normal(mu =, sigma = ) -``` +When we are interested in finding the _yth_ percentile we can use the quantile function, `qpois(p, lambda)`{.R} where: -### Lognormal distribution +- _p_ is the _percentile_ of interest, and +- _lambda_ is the constant _mean_ rate. -```{r univariate-distributions-lognormal-1} +```{r discrete-distributions-poisson-5} +# Find the 99th percentile of Y~Poisson(8) +percentile_99 <- qpois(0.99, 8) -# Plot the distribution function of Y~Lognormal() +paste0( + "The 99th percentile of Y~Pois(8) is ", + percentile_99, + "." +) ``` -### Exponential distribution +Finally, we finish with a plot of Poisson distributions. We note that for large $\lambda$ the Poisson distribution is well approximated by the normal distribution with $\mu = \lambda$ and $\sigma^2 = \lambda$. We can approximate the Poisson distribution with this normal distribution after allowing for a continuity correction: -```{r univariate-distributions-exponential-1} +$$F_{Pois}(x; \lambda) \approx F_{normal}(x + 0.5; \mu = \lambda, \sigma^2 = \lambda)$$ -# Plot the distribution function of Z~Exp() -``` +Here we have replaced $P(X \leq x)$ with $P(X \leq x + 0.5)$ on account of the continuity correction. -### Gamma distribution - -```{r univariate-distributions-gamma-1} - -# Plot the distribution of X~Gamma() -``` +```{r discrete-distributions-poisson-6} +# Plot the distribution function of X, Y, Z with: +# X,Y,Z~Pois(lambda) for various lambda +library(ggplot2) +set.seed(42) -### $\chi^2$ distribution +lambdas <- c( + "lambda = 8", + "lambda = 16", + "lambda = 32" +) -```{r univariate-distributions-chi-squared-1} +df <- data.frame( + dist_types = rep(lambdas, each = 1000), + dist_values = c( + rpois(1000, 8), + rpois(1000, 16), + rpois(1000, 32) + ) +) -# Plot the distribution of Y~Chi-Square() +ggplot(df, aes(x = dist_values, fill = dist_types)) + + geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") + + xlab("Distribution values") + + ylab("Count") + + theme_minimal() + + labs(fill = "") + + scale_fill_discrete(breaks = lambdas) + + ggtitle("Histogram of Poisson distributions for various lambda") ``` -### Student's $t$ distribution +## Uniform distribution {#discrete-uniform} -```{r univariate-distributions-t-1} +Our final discrete probability distribution concerns the uniform distribution ($Z$) defined over a finite set ${a, b}$. This is one discrete probability distribution which is **not** part of the source code for the `stats` package in base `R`. We can use `rdunif(n, b, a)` function within the `purrr` package^[[`purrr`](https://purrr.tidyverse.org/) is a part of the tidyverse suite of packages] to generate random deviates following the discrete uniform distribution. -# Plot the distribution of Z~t() -``` +We have: -### $F$ distribution +$$P(Z = k) = \frac{1}{b - a + 1}$$ -```{r univariate-distributions-F-1} +As an example we can simulate a die throw using $Z \sim \mathcal{U}(1, 6)$ as follows: -# Plot the distribution of X~F() +```{r discrete-distributions-uniform-1} +# Generate 5 random deviates on Z~Unif(1, 6) +set.seed(42) +purrr::rdunif(5, 6, 1) ``` -### Beta distribution +To find $P(Z \leq k)$ we can calculate this using the CDF: -```{r univariate-distributions-Beta-1} +$$P(Z \leq k) = \frac{\lfloor k \rfloor - a + 1}{b - a + 1}$$ -# Plot the distribution of Y~Beta() +```{r discrete-distributions-uniform-2} +# Find P(Z <= 2) with Z~Unif(1, 6) +prob_uniform <- (2 - 1 + 1) / (6 - 1 + 1) + +paste0( + "P(Z <= 2) with Z~Unif(1, 6) is ", + format(prob_uniform, digits = 4), + "." +) ``` -### Uniform distribution - continous +To find $P(Z = k)$ we can use the PMF, $P(Z = k) = \frac{1}{b - a + 1}$: -```{r univariate-distributions-uniform-continous-1} +```{r discrete-distributions-uniform-3} +# Find P(Z = 2) with Z~Unif(1, 6) +prob_uniform <- 1 / (6 - 1 + 1) -# Plot the distribution of Z~Uniform() +paste0( + "P(Z = 2) with Z~Unif(1, 6) is ", + format(prob_uniform, digits = 4), + "." +) ``` -## Poisson process +To find $P(k_1 \leq Z \leq k_2)$ we can sum over the PMFs: -```{r univariate-distributions-poisson-process-1} +```{r discrete-distributions-uniform-4} +# Find P(3 <= Z <= 5) where Z~Unif(1, 6) +# P(Z = k) is constant for all k: +prob_uniform <- 1 / (6 - 1 + 1) +# We have k = 3, 4, 5: +prob_uniform <- prob_uniform * 3 +paste0( + "P(3 <= Z <= 5) with Z~Unif(1, 6) is ", + format(prob_uniform, digits = 4), + "." +) ``` -## Inverse transform method +When we are interested in finding the _zth_ percentile we can use the quantile function $F^{-1}_{Z}(p)$: -The inverse transform method is a way to generate psuedo-random numbers from any probability distribution. +$$F^{-1}(p) = \lfloor (b - a + 1)p \rfloor - 1$$ -One possible algorithm is as follows: +```{r discrete-distributions-uniform-5} +# Plot the distribution function of Z~Uniform(1, 6) +percentile_99 <- floor((6 - 1 + 1)*0.99) -1. Generate a random number $u$ from $U \sim \mathcal{U}(0, 1) -2. Find the inverse of the desired cumulative distribution function, $F^{-1}_X(x)$ -3. Compute $X = F^{-1}_X(u)$ +paste0( + "The 99th percentile of Z~Unif(1, 6) is ", + percentile_99, + "." +) +``` -Suppose we wanted to draw 10,000 random numbers from $X \sim Exp(\lambda = 2)$. In order to use the inverse transform method we first need to find the inverse of the CDF. For any $X \sim Exp(\lambda)$, the inverse of the CDF is $\frac{-log(1-x)}{\lambda}$. We can thus use the inverse transform algorithm to generate random deviates following $X \sim Exp(2)$: +Finally, we finish with a plot of uniform distributions for various inputs. -```{r univariate-distributions-inverse-transform-1} -# Step 0 - to guarantee reproducibility +```{r discrete-distributions-uniform-6} +# Plot the distribution function of X, Y, Z with: +# X,Y,Z~Unif(a, b) for various {a, b} +library(ggplot2) set.seed(42) -# Step 1 - generate 10,000 random deviates from U[0,1] -u <- runif(10000) - -# Step 2 - find the inverse of the CDF: 1 - exp(-lambda.x) -# Inverse of CDF = -log(1 - x) / lambda +finite_sets <- c( + "a = 1, b = 6", # 1 die + "a = 1, b = 12", # 2 dice + "a = 1, b = 20" # d20 die from D&D +) -# Step 3 - compute X using the inverse of the CDF [from step 2] and the random deviates u [from step 1] -x <- -log(1 - u) / 2 +df <- data.frame( + dist_types = rep(finite_sets, each = 5000), + dist_values = c( + purrr::rdunif(5000, 1 , 6), + purrr::rdunif(5000, 1, 12), + purrr::rdunif(5000, 1, 20) + ) +) -# Plot the resulting x deviates -library(ggplot2) -df <- data.frame(x = x) -ggplot(df, aes(x=x)) + - geom_histogram(binwidth = 0.5, colour="black", fill="white") +ggplot(df, aes(x = dist_values, fill = dist_types)) + + geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") + + xlab("Distribution values") + + ylab("Count") + + theme_minimal() + + labs(fill = "") + + scale_fill_discrete(breaks = finite_sets) + + ggtitle("Histogram of uniform distributions for various inputs") ``` -## `R` Practice {-#practice-univariate-distributions} +## `R` Practice {-#practice-discrete-distributions} -We finish with a comprehensive example of an univariate distribution question in `R`. \ No newline at end of file +We finish with a comprehensive example of an univariate discrete distribution question in `R`. \ No newline at end of file