After working with PyTorch quite a bit I decided that I had to implement the basic MLP from scratch at least once. The theoretical foundations were not new to me but I find the derivation of backpropagation using sum formulae unreadable. You'll find a rough overview and the derivation in matrix notation below.
A yaml config file configures the basic network and training details like layer sizes, activation functions, loss function and more. The main.py script is the main entrypoint and contains three configurable components:
- the path to the config file,
- a numpy definition of the target function that shall be approximated and
- the setup of the data sampler which is controlled by a
std
term that determines the amount of noise to apply to training samples and the x range to draw from.
Basic checkpointing is supported and also configured in the config file. Execute training/validation cycle by just running main.py. The loss curve will be drawn when done.
The file source/network.py
contains the entire network definition.
- Add batch support
- Add support for classification problems
The MLP consists of a series of layers. The above figure for example has
The basic equation for each node
for
Introducing matrix and Einstein notation, the equation for the entire layer
The key takeaways are:
- That the formula is generic for all layers (input, hidden and output) and
- That the layer output is a function of 3 parameters
$\mathbf{a}^{l} = \mathbf{a}^{l}(\mathbf{a}^{l-1}, \mathbf{W}^{l}, \mathbf{b}^{l})$
Concerning the notation, the input to the first layer is generally denoted by
With the formula for a single layer, network can now be described by recursive application of the layers from the right (output) to the left (input).
Weights and biases of a neural network are updated by application of some adaption of the gradient descent algorithm. The efficient calculation of gradients by the backpropagation algorithm made larger networks feasible (alongside capable hardware of course) and is in essence just clever use of the chain rule.
In oder to train the network by gradient descent, i.E. find a solution with the lowest cost, a loss function
Introducing now the following
-
$\hat{L}:= \hat{L}(\bar{\mathbf{y}}, \mathbf{y})$ , -
$\mathbf{z}^l(\mathbf{a}^{l-1}, \mathbf{W}^{l}, \mathbf{b}^{l}) := \mathbf{W}^l \cdot \mathbf{a}^{l-1} + \mathbf{b}^l$ , -
$\mathbf{a}^L(\mathbf{z}^L) := \mathbf{y}$ , - for
$\frac{d\mathbf{z}^l(\mathbf{a}^{l-1}, \mathbf{W}^{l}, \mathbf{b}^{l})}{d\mathbf{W}^{l}} = \mathbf{a}^{l-1}$ , - and
$\delta^l := \frac{d\mathbf{a}^l}{d\mathbf{z}^l}$ .
Provided implementations for the analytic derivative of functions like the loss
The gradient w.r.t. the weights of neurons in the last layer
Equivalently, calculating the derivative w.r.t. the weights of the second to last layer
This scheme can be applied repeatedly until arriving at the
The
Derivatives w.r.t. biases
By introducing the used abbreviations, the obtained equations are of a structure that helps to understand the core principle of backpropagation, the ability to cache and reuse calculations per layer. The backpropagation algorithm is applied by executing the backward pass to the network layers in reverse ordering:
- Layer
$L$ receives the analytic derivative of the loss function w.r.t. the output$\frac{d\hat{L}}{d\mathbf{a}^L}$ and returns$\frac{d\hat{L}}{d\mathbf{a}^L} (\delta^L \cdot \mathbf{W}^{L})$ - Layer
$L-1$ receives$\frac{d\hat{L}}{d\mathbf{a}^L} (\delta^L \cdot \mathbf{W}^{L})$ and returns$\frac{d\hat{L}}{d\mathbf{a}^L} (\delta^L \cdot \mathbf{W}^{L}) (\delta^{L-1} \cdot \mathbf{W}^{L-1})$ - ...
- Layer 0 receives
$\frac{d\hat{L}}{d\mathbf{a}^L} (\delta^L \cdot \mathbf{W}^{L})...(\delta^1 \cdot \mathbf{W}^1)$
Within each layer, the gradients for weights and biases are calculated and cached. The optimizer (gradient descent algorithm) is finally applied by iterating over all layers and updating weights and biases.