forward.Rmd

# Forward Mode

The forward-mode automatic differentiation algorithm, also known as
the tangent method, is a means for efficiently computing derivatives
of smooth functions $f:\mathbb{R} \rightarrow \mathbb{R}^M$
with a single input and multiple outputs.

Suppose $x \in \mathbb{R}$ and that $v = f(u)$.  We write a
dot over an expression to indicate a derivative with respect to a
distinguished variable $x$,
$$
\dot{u} = \frac{\partial}{\partial x} u.
$$
The term $\dot{u}$ is called the tangent of $u$ with respect to $x$;
the $x$ is implicit in the notation, but assumed to be the same $x$ in
expressions with multiple dotted expressions.

For example, if $v = -u$, then by the chain rule,
$$
\dot{v}
= \frac{\partial}{\partial x} v
= \frac{\partial}{\partial x} -u
= - \frac{\partial}{\partial x} u
= -\dot{u}.
$$
Similarly, if we have $v = \exp(u)$, then
$$
\dot{v}
= \frac{\partial}{\partial x} v
= \frac{\partial}{\partial x} \exp(u)
= \exp(u) \cdot \frac{\partial}{\partial x} u
= \exp(u) \cdot \dot{u}.
$$

Derivative propagatin in forward mode works the same way for
multivariate functions.  For example, if $y = u \cdot v$ is a product,
the usual derivative rule for products applies,
$$
\dot{y}
= \frac{\partial}{\partial x} u \cdot v
= \left( \frac{\partial}{\partial x} u \right) \cdot v
+ u \cdot \left( \frac{\partial}{\partial x} v \right)
= \dot{u} \cdot v + u \cdot \dot{v}.
$$

## Dual numbers

Forward-mode automatic differentiation can be formalized using dual
numbers consisting of the value of an expression and its derivative,
$\langle u, \dot{u} \rangle.$  Smooth functions may then be extended
to operate on dual numbers.  For example, negation is defined for dual
numbers by
$$
-\langle u, \dot{u}\rangle = \langle -u, -\dot{u} \rangle.
$$
For sums,
$$
\langle u, \dot{u} \rangle + \langle v, \dot{v} \rangle
= \langle u + v, \dot{u} + \dot{v} \rangle,
$$
and for differences
$$
\langle u, \dot{u} \rangle - \langle v, \dot{v} \rangle
= \langle u - v, \dot{u} - \dot{v} \rangle,
$$
For products,
$$
\langle u, \dot{u} \rangle \cdot \langle v, \dot{v} \rangle
= \langle u \cdot v, \dot{u} \cdot v + u \cdot \dot{v} \rangle
$$
and for quotients,
$$
\langle u, \dot{u} \rangle / \langle v, \dot{v} \rangle
= \langle u / v, \dot{u} / v - u / v^2 \cdot \dot{v} \rangle.
$$
For exponentiation,
$$
\exp\left( \langle u, \dot{u} \rangle \right)
= \langle \exp(u), \exp(u) \cdot \dot{u}\rangle,
$$
and for the logarithm,
$$
\log \langle u, \dot{u} \rangle
=
\langle \log u, \frac{1}{u} \cdot \dot{u} \rangle.
$$
These all translate directly into rules of tangent propagation.


## Gradient-vector products

The product of the gradient of a function at a given point and an
arbitrary vector can be computed efficiently using forward-mode
automatic differentiation.  Suppose $f : \mathbb{R}^N \rightarrow
\mathbb{R}$ is a smooth function and $x \in \mathbb{R}^N$ is a point
in its domain.  To compute the derivative of $f$ at $x$ along an arbitrary
vector $v \in \mathbb{R}^N$, it suffices to compute the
gradient-vector product $\nabla_v f(x) = \nabla f(x) \cdot v$.  This
can be done with forward-mode automatic differentiation by
initializing the tangents of the input variable $x$ with the vector
being multiplied,
$$
\dot{x} = v.
$$
Then dual arithmetic is used as usual to compute the result, and the
final dual number's tangent is the gradient-vector product.

For example, suppose
$$
f(x) = x_1 \cdot x_2 + x_2,
$$
$$
x = \begin{bmatrix} 12.9 & 127.1 \end{bmatrix}^{\top},
$$
and
$$
v = \begin{bmatrix} 0.3 & -1.2 \end{bmatrix}^{\top}.
$$
The gradient-vector product $\nabla_v f(x) = \nabla f(x) \cdot v$
is derived as
$$
\begin{array}{rcl}
\langle 12.9, 0.3 \rangle \cdot \langle 127.1, -1.2 \rangle
+ \langle 127.1, -1.2 \rangle
& = &
\langle 12.9 \cdot 127.1, \, 0.3 \cdot 127.1 + 12.9 \cdot -1.2 \rangle
+ \langle 127.1, -1.2 \rangle
\\[4pt]
& = &
\langle 12.9 \cdot 127.1 + 127.1, \
        0.3 \cdot 127.1 + 12.9 \cdot -1.2 + -1.2
\rangle
\\[4pt]
& = &
\langle 1767, 21 \rangle.
\end{array}
$$
Checking the gradient-vector product analytically,
$$
\nabla f(x)
=
\begin{bmatrix} x_2 & x_1 + 1 \end{bmatrix}
=
\begin{bmatrix} 127.1 & 12.9 + 1 \end{bmatrix},
$$
so that
$$
\nabla f(x) \cdot v
= \begin{bmatrix} 127.1 & 13.9 \end{bmatrix}
  \cdot  \begin{bmatrix} 0.3 & -1.2 \end{bmatrix}^{\top}
  = 21.
$$


## Directional derivatives

A directional derivative measures the change in a multivariate
function in a given direction.  A direction can be specified as a
point on a sphere, which corresponds to a unit vector $v$, i.e., a
vector of unit length, where $\sum_{n=1}^N v_n^2 = 1.$ Given a smooth
function $f : \mathbb{R}^N \rightarrow \mathbb{R},$ its derivative in
the direction of unit vector $v \in \mathbb{R}^N$ at a point $x \in
\mathbb{R}^N$ is $\nabla f(x) \cdot v.$ That is, a directional
derivative is a gradient-vector product where the vector is a unit
vector.

Partial derivatives can be defined as vector-gradient products,
$$
\frac{\partial}{\partial x_n} f(x)
=
\nabla f(x) \cdot u_n,
$$
by taking $u_n$ to be the unit vector pointing along the $n$-th axis,
$$
u_n =
\Big[
\begin{array}[t]{c}
\
\underbrace{0  \cdots 0 }_{n - 1 \ \textrm{zeros}}
\ \ 1 \ \
\underbrace{0  \cdots 0}_{N - n \ \textrm{zeros}}
\
\end{array}
\Big].
$$