-
Notifications
You must be signed in to change notification settings - Fork 18
/
generalized-linear-models.Rmd
158 lines (141 loc) · 4.7 KB
/
generalized-linear-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# Generalized Linear Models
Generalized linear models are a form of regression where a linear
function of predictors is transformed to an appropriate domain and
then an appropriate sampling distribution is used. For example,
logistic regression involves a logistic link function that transforms
values in $(0, 1)$ to $\mathbb{R}$ combined with a Bernoulli sampling
distribution. Traditional linear regression uses the identity
function as a transform and a normal sampling distribution.
## Linear predictors and regression coefficients
In all of the generalized linear models, there will be $N$ observed
outcomes $y = y_1, \ldots, y_N,$ the range of which will vary among
the generalized linear models.
In all of the models, there will be an $N \times K$ data matrix $x$,
where each row vector $x_{n, 1:K} \in \mathbb{R}^K$ consists of predictors
for outcome $y_n.$
In all of the models, there will be parameters for an intercept
$\alpha \in \mathbb{R}$ and regression coefficients $\beta \in
\mathbb{R}^K$. The linear predictor for a generalized linear model
for item $y_n$ is
$$
\alpha + x_n \cdot \beta
= \alpha = \sum_{k = 1}^K x_{n, k} \cdot \beta_k.
$$
Each generalized linear model will then transform this linear
predictor and provide a corresponding sampling distribution, which may
include additional parameters.
## Logistic regression
Logistic regression involves binary data $y_n \in \{ 0, 1 \},$ a
logit link function (hence the name), and a Bernoulli sampling
distribution.
The logit function $\textrm{logit}:(0, 1) \rightarrow (-\infty,
\infty)$ maps a probability to its log odds, that is, the logarithm of
the odds it represents,
$$
\textrm{logit}(p) = \log \frac{p}{1 - p}.
$$
The inverse of the logit link function, $\textrm{logit}^{-1}:(-\infty,
\infty) \rightarrow (0, 1),$ maps real numbers back to probabilities
by
$$
\textrm{logit}^{-1}(a) = \frac{1}{1 = \exp(-a)}.
$$
The inverse logit function is a particular form of S-shaped, or
sigmoid function.^[Other popular sigmoid functions in statistical
applications include the hyperbolic tangent function
$\textrm{tanh}:\mathbb{R} \rightarrow (-1, 1)$ and the inverse
standard normal cumulative distribution function,
$\Phi^{-1}:\mathbb{R} \rightarrow (0, 1).$]
The Bernoulli sampling distribution is defined for $u \in \{ 0, 1 \}$
and $\theta \in (0, 1)$ by
$$
\textrm{bernoulli}(u \mid \theta)
=
\begin{cases}
\theta & \textrm{if} \ u = 1
\\[4pt]
1 - \theta & \textrm{if} \ u = 0.
\end{cases}
$$
The logistic regression probability mass function is defined for
$y \in \{ 0, 1 \}^N,$
$x \in \mathbb{R}^{N \times K},$ $\alpha \in \mathbb{R},$
and $\beta \in \mathbb{R}^K$ by
$$
p(y \mid x, \alpha, \beta)
= \prod_{n = 1}^N
\textrm{bernoulli}(y_n \mid
\textrm{logit}^{-1}(\alpha + x_n \cdot \beta)
).
$$
To avoid underflow to zero, it is necessary to work on the log scale,
where
$$
\log p(y \mid x, \alpha, \beta)\log p(y \mid x, \alpha, \beta)
= \sum_{n = 1}^N \log
\textrm{bernoulli}(y_n \mid
\textrm{logit}^{-1}(\alpha + x_n \cdot \beta)
).
$$
For logistic regression (and other generalized linear
models), the inverse link function applied to the linear predictor
produces the expected value,
$$
\widehat{y}_n = \textrm{logit}^{-1}(\alpha + x_n \cdot \beta)
$$
The derivatives work out very neatly for logistic regression. The
inverse logit function has a derivative that can be expressed neatly
in terms of the value,
$$
\frac{\textrm{d}}{\textrm{d} u}
\textrm{logit}^{-1}(u)
= \textrm{logit}^{-1}(u) \cdot (1 - \textrm{logit}^{-1}(u))
$$
Thus if $y = \textrm{logit}^{-1}(u),$ then $\frac{\textrm{d}}{\textrm{d}
u} y = y \cdot (1 - y).$
The next step is pushing this through the Bernoulli pmf, which is
$$
\frac{\partial}{\partial \theta}
\textrm{bernoulli}(u \mid \theta)
=
\begin{cases}
1 & \mbox{if} \ u = 1
\\[2pt]
-1 & \mbox{if} \ u = 0.
\end{cases}
$$
Taking the derivative of the logarithm can then be worked out as
$$
\frac{\partial}{\partial \theta}
\log \textrm{bernoulli}(u \mid \theta)
=
\begin{cases}
\frac{1}{\theta} & \mbox{if} \ u = 1
\\[2pt]
-\frac{1}{1 - \theta} & \mbox{if} \ u = 0.
\end{cases}
$$
Continuing to the full logistic regression log density,
\begin{eqnarray*}
\frac{\partial}{\partial \alpha} \log p(y \mid x, \alpha, \beta)
& = &
\sum_{n = 1}^N
\frac{\partial}{\partial \alpha}
\log \textrm{bernoulli}(y_n
\mid \textrm{logit}^{-1}(\alpha + x_n \cdot \beta))
\\[4pt]
& = &
\sum_{n = 1}^N (y_n - \widehat{y}_n)
\end{eqnarray*}
Derivatives with respect to the regression coefficients are just as simple,
$$
\frac{\partial}{\partial \beta} \log p(y \mid x, \alpha, \beta)
=
\sum_{n = 1}^N x_n \cdot (y_n - \widehat{y}_n),
$$
or coefficient-wise,
$$
\frac{\partial}{\partial \beta_k} \log p(y \mid x, \alpha, \beta)
=
\sum_{n = 1}^N x_{n, k} \cdot (y_n - \widehat{y}_n).
$$