-
Notifications
You must be signed in to change notification settings - Fork 18
/
reverse.Rmd
177 lines (163 loc) · 5.76 KB
/
reverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# Reverse Mode
The reverse-mode automatic differentiation algorithm, also known as
the adjoint method, is a means for efficiently computing derivatives
of smooth functions $f : \mathbb{R}^N \rightarrow \mathbb{R}$ from
multiple inputs to a single output (i.e., gradients).
## Adjoints
Suppose $y \in \mathbb{R}$ is the final dependent output variable and
$v$ is a subexpression used in the calculation of $y$. The adjoint of
$v$ is the derivative of $y$ with respect to $v$, and is written with
a bar,
$$
\overline{v} = \frac{\partial}{\partial v} y.
$$
Suppose that $v = f(u)$. The chain rule tells us that
$$
\overline{u}
= \frac{\partial y}{\partial u}
= \frac{\partial y}{\partial v}
\cdot \frac{\partial v}{\partial u}
= \frac{\partial y}{\partial v}
\cdot \frac{\partial f(u)}{\partial u}
= \frac{\partial y}{\partial v}
\cdot f'(u) \cdot \frac{\partial u}{\partial u}
= \overline{v} \cdot f'(u).
$$
Because a sub-expression may be involved in more than one expression,
the adjoint rules are expressed as increments,
$$
\overline{u} \ \ {\small +}{=} \ \ \overline{v} \cdot f'(u).
$$
For example, if $v = \exp(u)$, with $\exp'(u) = \exp(u)$, the rule is
$$
\overline{u} \ \ {\small +}{=} \ \ \overline{v} \cdot \exp(u).
$$
For logarithms, with $v = \log u$, with $\log'(u) = \frac{1}{u},$ the
adjoint rule is
$$
\overline{u} \ \ {\small +}{=} \ \ \overline{v} \cdot \frac{1}{u}.
$$
When there is more than one argument, adjoints propagate independently
from the result by multiplying the result's adjoint times the partial
with respect to the argument. For example, if $w = u \cdot v$, then because
$\frac{\partial}{\partial u} u \cdot v = v$ and
$\frac{\partial}{\partial v} u \cdot v = u$, the adjoint rules are
$$
\overline{u} \ \ {\small +}{=} \ \ \overline{w} \cdot v
$$
and
$$
\overline{v} \ \ {\small +}{=} \ \ \overline{w} \cdot u.
$$
## Adjoint propagation and continuations
Adjoint rules increment the adjoints of operands based on the adjoint
of the result, and as such, must be executed after the adjoint of the
result has been computed. This leads to adjoints being propagated in
the reverse order of function computation, hence the name of the
method.^[Any topological sort of the expressions where results are
greater than operands will suffice.]
Computationally, as each operation executes, the code to handle
adjoint propagation is pushed onto a stack. After the final value is
computed for which derivatives are required, the adjoints are executed
by popping adjoint code off the stack and executing it. The adjoint
code amounts to a continuation, that is code that is stored with some
of its context to be executed later. As usual, continuations can be
implemented via closures and the closures stored on a stack to be
executed in last-in/first-out order.
For example, consider a simple compound expression $y = \log (u \cdot
v).$ To compute the value of $y$, First, $u \cdot v$ is executed,
then the logarithm of the product is computed. This can be expressed
with intermediate variables as a function of independent variables $u$
and $v$ as follows, where the numbers in parentheses indicate the
order of steps performed.
$$
\begin{array}{ll||lr}
\textrm{execution}
& \textrm{forward}
& \textrm{reverse}
& \textrm{execution}
\\[-4pt]
\textrm{order} \downarrow
& \textrm{values}
& \textrm{adjoints}
& \textrm{order} \uparrow
\\ \hline
(1) & a = u \cdot v & \overline{u} \ \ {\small +}{=} \ \ \overline{a} \cdot v & (6)
\\
& & \overline{v} \ \ {\small +}{=} \ \ \overline{a} \cdot u & (5)
\\ \hline
(2) & b = \log a & \overline{a} \ \ {\small +}{=} \ \ \overline{b} \cdot \frac{1}{a}
& (4)
\\ \hline
& & \overline{b} = 1 & (3)
\end{array}
$$
First, the values are computed in a forward pass (steps 1 and 2).
Then the adjoint of the final result is set to one (step 3) and all
other adjoints are initialized to zero. Then the adjoints are
propagated in a reverse pass (steps 4, 5, and 6). These steps are
executed with concrete values. For example, taking $u = 1.2$ and $v =
3.9$, the execution order is as follows, with values rounded to two
decimal places,
$$
\begin{array}{c|rll|c|r}
\textrm{step} & \textrm{variable} & \textrm{op} & \textrm{value} &
\textrm{symbolic} & \textrm{numeric} \\ \hline
& u & = & 1.2 & u & 1.20
\\
& v & = & 3.9 & v & 3.90
\\ \hline
(1) & a & = & u \cdot v & u \cdot v & 4.68
\\
(2) & b & = & \log a & \log v \cdot u & \approx 1.60
\\ \hline
(3) & \overline{b} & = & 1 & 1 & 1.00
\\ \hline
(4) & \overline{a} & {\small +}{=} & \overline{b} \cdot \frac{1}{a}
& \frac{1}{u \cdot v}
& \approx 0.21
\\
(5) & \overline{v} & {\small +}{=} & \overline{a} \cdot u
& \frac{1}{v} & \approx 0.26
\\
(6) & \overline{u} & {\small +}{=} & \overline{a} \cdot v
& \frac{1}{u} & \approx 0.83
\end{array}
$$
Before the algorithm begins, the independent variables $u$ and $v$ are
assumed to be set to their values. Then steps (1) and (2) calculate
the values of subexpressions, first of $a = u \cdot v$ and then of $b
= \log u \cdot v$, the final result. To start the reverse pass in
step (3), the final result $b$ has its adjoint $\overline{b}$ set to 1.
The reverse pass then increments the adjoint of each operand involved
in the following expression. The final adjoints $\overline{u}$ and
$\overline{v}$ are the elements of the gradient of $f(u, v) = \log u \cdot
v$, evaluated at $\begin{bmatrix} 1.2 & 3.9 \end{bmatrix}$,
$$
\nabla f(1.2, 3.9)
=
\begin{bmatrix}
\frac{\partial}{\partial u} \log u \cdot v
&
\frac{\partial}{\partial v} \log u \cdot v
\end{bmatrix}
\approx
\begin{bmatrix}
0.83 & 0.26
\end{bmatrix}.
$$
The results of automatic differentiation can be verified with
analytical derivatives,
$$
\overline{u}
= \frac{\partial}{\partial u} \log u \cdot v
= \frac{1}{u}
\approx 0.83
$$
and
$$
\overline{v}
= \frac{\partial}{\partial v} \log u \cdot v
= \frac{1}{v}
\approx 0.26.
$$