reverse.Rmd

# Reverse Mode

The reverse-mode automatic differentiation algorithm, also known as
the adjoint method, is a means for efficiently computing derivatives
of smooth functions $f : \mathbb{R}^N \rightarrow \mathbb{R}$ from
multiple inputs to a single output (i.e., gradients).

## Adjoints

Suppose $y \in \mathbb{R}$ is the final dependent output variable and
$v$ is a subexpression used in the calculation of $y$.  The adjoint of
$v$ is the derivative of $y$ with respect to $v$, and is written with
a bar,
$$
\overline{v} = \frac{\partial}{\partial v} y.
$$

Suppose that $v = f(u)$.  The chain rule tells us that
$$
\overline{u}
= \frac{\partial y}{\partial u}
= \frac{\partial y}{\partial v}
\cdot \frac{\partial v}{\partial u}
= \frac{\partial y}{\partial v}
\cdot \frac{\partial f(u)}{\partial u}
= \frac{\partial y}{\partial v}
\cdot f'(u) \cdot \frac{\partial u}{\partial u}
= \overline{v} \cdot f'(u).
$$
Because a sub-expression may be involved in more than one expression,
the adjoint rules are expressed as increments,
$$
\overline{u} \ \ {\small +}{=} \ \ \overline{v} \cdot f'(u).
$$

For example, if $v = \exp(u)$, with $\exp'(u) = \exp(u)$, the rule is
$$
\overline{u} \ \ {\small +}{=} \ \ \overline{v} \cdot \exp(u).
$$
For logarithms, with $v = \log u$, with $\log'(u) = \frac{1}{u},$ the
adjoint rule is
$$
\overline{u} \ \ {\small +}{=} \ \ \overline{v} \cdot \frac{1}{u}.
$$

When there is more than one argument, adjoints propagate independently
from the result by multiplying the result's adjoint times the partial
with respect to the argument.  For example, if $w = u \cdot v$, then because
$\frac{\partial}{\partial u} u \cdot v = v$ and
$\frac{\partial}{\partial v} u \cdot v = u$, the adjoint rules are
$$
\overline{u} \ \ {\small +}{=} \ \ \overline{w} \cdot v
$$
and
$$
\overline{v} \ \ {\small +}{=} \ \ \overline{w} \cdot u.
$$

## Adjoint propagation and continuations

Adjoint rules increment the adjoints of operands based on the adjoint
of the result, and as such, must be executed after the adjoint of the
result has been computed.  This leads to adjoints being propagated in
the reverse order of function computation, hence the name of the
method.^[Any topological sort of the expressions where results are
greater than operands will suffice.]

Computationally, as each operation executes, the code to handle
adjoint propagation is pushed onto a stack.  After the final value is
computed for which derivatives are required, the adjoints are executed
by popping adjoint code off the stack and executing it.  The adjoint
code amounts to a continuation, that is code that is stored with some
of its context to be executed later.  As usual, continuations can be
implemented via closures and the closures stored on a stack to be
executed in last-in/first-out order.

For example, consider a simple compound expression $y = \log (u \cdot
v).$  To compute the value of $y$, First, $u \cdot v$ is executed,
then the logarithm of the product is computed.  This can be expressed
with intermediate variables as a function of independent variables $u$
and $v$ as follows, where the numbers in parentheses indicate the
order of steps performed.

$$
\begin{array}{ll||lr}
\textrm{execution}
& \textrm{forward}
& \textrm{reverse}
& \textrm{execution}
\\[-4pt]
\textrm{order} \downarrow
& \textrm{values}
& \textrm{adjoints}
& \textrm{order} \uparrow
\\ \hline
(1) & a = u \cdot v & \overline{u} \ \ {\small +}{=} \ \ \overline{a} \cdot v & (6)
\\
    &               & \overline{v} \ \ {\small +}{=} \ \ \overline{a} \cdot u & (5)
\\ \hline
(2) & b = \log a    & \overline{a} \ \ {\small +}{=} \ \ \overline{b} \cdot \frac{1}{a}
& (4)
\\ \hline
&               & \overline{b} = 1 & (3)
\end{array}
$$

First, the values are computed in a forward pass (steps 1 and 2).
Then the adjoint of the final result is set to one (step 3) and all
other adjoints are initialized to zero.  Then the adjoints are
propagated in a reverse pass (steps 4, 5, and 6).  These steps are
executed with concrete values.  For example, taking $u = 1.2$ and $v =
3.9$, the execution order is as follows, with values rounded to two
decimal places,
$$
\begin{array}{c|rll|c|r}
\textrm{step} & \textrm{variable} & \textrm{op} & \textrm{value} &
\textrm{symbolic} & \textrm{numeric} \\ \hline
& u & = & 1.2 & u & 1.20
\\
& v & = & 3.9 & v & 3.90
\\ \hline
(1) & a & = & u \cdot v & u \cdot v & 4.68
\\
(2) & b & = & \log a & \log v \cdot u & \approx 1.60
\\ \hline
(3) & \overline{b} & = & 1 & 1 & 1.00
\\ \hline
(4) & \overline{a} & {\small +}{=} & \overline{b} \cdot \frac{1}{a}
& \frac{1}{u \cdot v}
& \approx 0.21
\\
(5) & \overline{v} & {\small +}{=} & \overline{a} \cdot u
& \frac{1}{v} & \approx 0.26
\\
(6) & \overline{u} & {\small +}{=} & \overline{a} \cdot v
& \frac{1}{u} & \approx 0.83
\end{array}
$$

Before the algorithm begins, the independent variables $u$ and $v$ are
assumed to be set to their values.  Then steps (1) and (2) calculate
the values of subexpressions, first of $a = u \cdot v$ and then of $b
= \log u \cdot v$, the final result.  To start the reverse pass in
step (3), the final result $b$ has its adjoint $\overline{b}$ set to 1.
The reverse pass then increments the adjoint of each operand involved
in the following expression.  The final adjoints $\overline{u}$ and
$\overline{v}$ are the elements of the gradient of $f(u, v) = \log u \cdot
v$, evaluated at $\begin{bmatrix} 1.2 & 3.9 \end{bmatrix}$,
$$
\nabla f(1.2, 3.9)
=
\begin{bmatrix}
\frac{\partial}{\partial u} \log u \cdot v
&
\frac{\partial}{\partial v} \log u \cdot v
\end{bmatrix}
\approx
\begin{bmatrix}
0.83 & 0.26
\end{bmatrix}.
$$

The results of automatic differentiation can be verified with
analytical derivatives,
$$
\overline{u}
= \frac{\partial}{\partial u} \log u \cdot v
= \frac{1}{u}
\approx 0.83
$$
and
$$
\overline{v}
= \frac{\partial}{\partial v} \log u \cdot v
= \frac{1}{v}
\approx 0.26.
$$