-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
469 additions
and
133 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,274 @@ | ||
--- | ||
title: 'Probability Distributions' | ||
author: "Nicholas J. Gotelli" | ||
date: "20 February 2024" | ||
output: | ||
html_document: | ||
highlight: tango | ||
theme: united | ||
pdf_document: default | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(eval = FALSE) | ||
``` | ||
|
||
### Probability distributions in R | ||
#### Discrete distributions | ||
|
||
- Poisson | ||
* Range: [0,$\infty$] | ||
* Parameters: size = number of events, rate = $\lambda$ | ||
* Interpretation: Distribution of events that occur during a fixed time interval or sampling effort with a constant rate of independent events; resembles normal with large $\lambda$, or exponential with small $\lambda$ | ||
|
||
- Binomial | ||
* Range: [0, # of trials] | ||
* Parameters: size= number of trials; p = probability of positive outcome | ||
* Interpretation: Distribution of number of successful independent dichotomous trials, with constant p | ||
|
||
- Negative Binomial | ||
* Range: [0, $\infty$] | ||
* Parameters: size=number of successes; p = probability of success | ||
* Interpretation: Distribution of number of failures in a series of independent Bernouli trials, each with p = probability of a success. Generates a discrete distribution that is more heterogeneous ("overdispersed") than Poisson | ||
|
||
#### Continuous distributions | ||
|
||
- Uniform | ||
* Range: [min,max] | ||
* Parameters: min = minimum boundary; max = maximum boundary | ||
* Interpretation: Distribution of a value that is equally likely within a specified range | ||
|
||
- Normal | ||
* Range: [$-\infty,\infty$] | ||
* Parameters: mean = central tendency; SD = standard deviation | ||
* Interpretation: Symmetric bell-shaped curve with unbounded tails | ||
|
||
- Gamma $\Gamma$ | ||
* Range: [0,$\infty$] | ||
* Parameters: shape, scale | ||
* Interpretation: mean=$shape*scale$, variance=$shape*scale^2$; generates a variety of shapes (including normal and exponential) for positive continuous variables | ||
|
||
- Beta $\beta$ | ||
* Range: [0,1] (can be rescaled to any range by simple multiplication and addition) | ||
* Paramters: shape1, shape2 | ||
* Interpretation: if shape1 and shape 2 are integers, interpret as a coin toss, with shape1 = # of successes + 1, shape2 = # of failures + 1. Gives distribution of value of p, estimated from data, which can range from exponential through uniform through normal (but all are bounded). Setting shape1 and shape2 <1 yields u-shaped distributions. | ||
|
||
### The "grammar" of probability distributions in R | ||
- `d` gives probability density function | ||
- `p` gives cumulative distribution function | ||
- `q` gives quantile function (the inverse of `p`) | ||
- `r` gives random number generation | ||
|
||
Combine these with the base name of the function. For example `rbinom` gives a set of random values drawn from a binomial, whereas `dnorm` gives the density function for a normal distribution. There are many probability distributions available in R, but we will discuss only 7 of them. | ||
|
||
#### Poisson distribution | ||
|
||
```{r} | ||
library(ggplot2) | ||
library(MASS) | ||
#------------------------------------------------- | ||
# Poisson distribution | ||
# Discrete X >= 0 | ||
# Random events with a constant rate lambda | ||
# (observations per time or per unit area) | ||
# Parameter lambda > 0 | ||
# "d" function generates probability density | ||
hits <- 0:10 | ||
myVec <- dpois(x=hits,lambda=1) | ||
qplot(x=hits,y=myVec,geom="col",color=I("black"),fill=I("goldenrod")) | ||
myVec <- dpois(x=hits,lambda=2) | ||
qplot(x=hits,y=myVec,geom="col",color=I("black"),fill=I("goldenrod")) | ||
hits <- 0:15 | ||
myVec <- dpois(x=hits,lambda=6) | ||
qplot(x=hits,y=myVec,geom="col",color=I("black"),fill=I("goldenrod")) | ||
hits <- 0:15 | ||
myVec <- dpois(x=hits,lambda=0.2) | ||
qplot(x=hits,y=myVec,geom="col",color=I("black"),fill=I("goldenrod")) | ||
sum(myVec) # sum of density function = 1.0 (total area under curve) | ||
# for a Poisson distribution with lambda=2, | ||
# what is the probability that a single draw will yield X=0? | ||
dpois(x=0,lambda=2) | ||
# "p" function generates cumulative probability density; gives the | ||
# "lower tail" cumulative area of the distribution | ||
hits <- 0:10 | ||
myVec <- ppois(q=hits,lambda=2) | ||
qplot(x=hits,y=myVec,geom="col",color=I("black"),fill=I("goldenrod")) | ||
# for a Poisson distribution with lambda=2, | ||
# what is the probability of getting 1 or fewer hits? | ||
ppois(q=1, lambda=2) | ||
# We could also get this through dpois | ||
p_0 <- dpois(x=0,lambda=2) | ||
p_0 | ||
p_1 <- dpois(x=1,lambda=2) | ||
p_1 | ||
p_0 + p_1 | ||
# The q function is the inverse of p | ||
# What is the number of hits corresponding to 50% of the probability mass | ||
qpois(p=0.5,lambda=2.5) | ||
qplot(x=0:10,y=dpois(x=0:10,lambda=2.5),geom="col",color=I("black"),fill=I("goldenrod")) | ||
# but distribution is discrete, so this is not exact | ||
ppois(q=2,lambda=2.5) | ||
# finally, we can simulate individual values from a poisson | ||
ranPois <- rpois(n=1000,lambda=2.5) | ||
qplot(x=ranPois,color=I("black"),fill=I("goldenrod")) | ||
# for real or simulated data, we can use the quantile | ||
# function to find the empirical 95% confidence interval on the data | ||
quantile(x=ranPois,probs=c(0.025,0.975)) | ||
``` | ||
|
||
#### Binomial distribution | ||
|
||
```{r} | ||
#------------------------------------------------- | ||
# Binomial distribution | ||
# p = probability of a dichotomous outcome | ||
# size = number of trials | ||
# x = possible outcomes | ||
# outcome x is bounded between 0 and number of trials | ||
# use "d" binom for density function | ||
hits <- 0:10 | ||
myVec <- dbinom(x=hits,size=10,prob=0.5) | ||
qplot(x=0:10,y=myVec,geom="col",color=I("black"),fill=I("goldenrod")) | ||
# and how does this compare to an actual simulation of 50 tosses of 100 coins? | ||
myCoins <- rbinom(n=50,size=100,prob=0.5) | ||
qplot(x=myCoins,color=I("black"),fill=I("goldenrod")) | ||
quantile(x=myCoins,probs=c(0.025,0.975)) | ||
``` | ||
|
||
#### Negative Binomial distribution | ||
|
||
```{r} | ||
#------------------------------------------------- | ||
# negative binomial: number of failures (values of MyVec) | ||
# in a series of (Bernouli) with p=probability of success | ||
# before a target number of successes (= size) | ||
# generates a discrete distribution that is more | ||
# heterogeneous ("overdispersed") than Poisson | ||
hits <- 0:40 | ||
myVec <- dnbinom(x=hits, size=5, prob=0.5) | ||
qplot(x=hits,y=myVec,geom="col",color=I("black"),fill=I("goldenrod")) | ||
# geometric series is a special case where N= 1 success | ||
# each bar is a constant fraction 1 - "prob" of the bar before it | ||
myVec <- dnbinom(x=hits, size=1, prob=0.1) | ||
qplot(x=hits,y=myVec,geom="col",color=I("black"),fill=I("goldenrod")) | ||
# alternatively specify mean = mu of distribution and size, | ||
# the dispersion parameter (small is more dispersed) | ||
# this gives us a poisson with a lambda value that varies | ||
# the dispersion parameter is the shape parameter in the gamma | ||
# as it increases, the distribution has a smaller variance | ||
# just simulate it directly | ||
nbiRan <- rnbinom(n=1000,size=10,mu=5) | ||
qplot(nbiRan,color=I("black"),fill=I("goldenrod")) | ||
nbiRan <- rnbinom(n=1000,size=0.1,mu=5) | ||
qplot(nbiRan,color=I("black"),fill=I("goldenrod")) | ||
``` | ||
### estimating paramaters from data | ||
`p(data|hypothesis)` is the standard approach for null hypothesis tests. | ||
|
||
Model paramaters can be thought of as a hypothesis about the distribution of the data: `p(data|paramaters)` | ||
|
||
This would be a goodness of fit test for data to a particular statistical distribution. | ||
|
||
More useful might be `p(paramaters|data)`. For a given set of data, what is the probability that a particular statistical distribution (such as a normal with a specified mean and standard deviation), provides a fit to these data? | ||
|
||
Clearly, we want paramaters that *maximize* this probability. The paramaters that fit this definition are *maximum likelihood estimators*. | ||
|
||
Here is a simple example: | ||
|
||
```{r} | ||
library(MASS) | ||
data <- c(100,100,104,99) | ||
z <- fitdistr(data,"normal") | ||
print(z) | ||
``` | ||
|
||
For this procedure: | ||
|
||
1. Select a model that is appropriate for the data | ||
2. Use `fitdistr` to find maximum likelihood estimators for the paramaters. | ||
3. Use those paramaters to either plot the density function or simulate new data. | ||
|
||
Here are a few data from Lauren Ash's dissertation: snout-vent length (SVL) measurements for 6 specimens of green frogs (*Lithobates clamitans*) collected from Berlin Pond in 2016. | ||
|
||
```{r} | ||
frog_data <- c(30.15,26.3,27.5,22.9,27.8,26.2) | ||
# get and print model parameters assuming a normal distribution | ||
z<- fitdistr(frog_data,"normal") | ||
print(z) | ||
# plot the density function for the normal and annotate the plot with the original data | ||
x <- 1:100 | ||
g_density <- dnorm(x=x,mean=z$estimate["mean"],sd=z$estimate["sd"]) | ||
qplot(x,g_density,geom="line") + annotate(geom="point",x=frog_data,y=0.001) | ||
# now do the same for a gamma distribution, which has a shape and rate parameter as outputs from fitdistr | ||
z<- fitdistr(frog_data,"gamma") | ||
print(z) | ||
# plot the density function for the gamma and annotate the plot with the original data | ||
x <- 1:100 | ||
g_density <- dgamma(x=x,shape=z$estimate["shape"],rate=z$estimate["rate"]) | ||
qplot(x,g_density,geom="line") + annotate(geom="point",x=frog_data,y=0.001,color="red") | ||
``` | ||
|
||
There is not much difference here.But now let's add a weird outlier of a tiny value and see what happens | ||
```{r} | ||
frog_data <- c(30.15,26.3,27.5,22.9,27.8,26.2) | ||
outlier <- 0.050 | ||
frog_data <- c(frog_data,outlier) | ||
# get and print model parameters assuming a normal distribution | ||
z<- fitdistr(frog_data,"normal") | ||
print(z) | ||
# plot the density function for the normal and annotate the plot with the original data | ||
x <- 1:100 | ||
g_density <- dnorm(x=x,mean=z$estimate["mean"],sd=z$estimate["sd"]) | ||
qplot(x,g_density,geom="line") + annotate(geom="point",x=frog_data,y=0.001) | ||
# now do the same for a gamma distribution, which has a shape and rate parameter as outputs from fitdistr | ||
z<- fitdistr(frog_data,"gamma") | ||
print(z) | ||
# plot the density function for the gamma and annotate the plot with the original data | ||
x <- 1:100 | ||
g_density <- dgamma(x=x,shape=z$estimate["shape"],rate=z$estimate["rate"]) | ||
qplot(x,g_density,geom="line") + annotate(geom="point",x=frog_data,y=0.001,color="red") | ||
``` | ||
|
Oops, something went wrong.