This document provides an explanation of the mathematical principles underlying the machine learning models implemented in this project, focusing primarily on the Multi-Layer Perceptron (MLP). It includes detailed mathematical derivations and proofs to offer a deeper understanding of the algorithms.
- Mathematical Foundations
- Pseudo-Code for Multi-Layer Perceptron (MLP) Training
- MLP Training Algorithm
- Detailed Steps
- Mathematical Symbols and Notations
- Multi-Layer Perceptron (MLP)
- Convolutional Neural Networks (CNNs)
- GPU Acceleration vs. CPU Parallelization
This pseudo-code outlines the training process of a Multi-Layer Perceptron (MLP) model using mathematical notation. It captures the essence of forward propagation, loss computation, backpropagation, and parameter updates.
$$
\begin{align*}
&\textbf{Inputs:}
&\quad \text{Training dataset: } { (\mathbf{x}^{(i)}, \mathbf{y}^{(i)}) }{i=1}^{N}
&\quad \text{Number of epochs: } T
&\quad \text{Learning rate: } \eta
&\quad \text{Network architecture:}
&\quad \quad \text{Input size: } n
&\quad \quad \text{Hidden layer size: } m
&\quad \quad \text{Output size: } k
&\textbf{Initialize Parameters:}
&\quad \mathbf{W}^{(1)} \in \mathbb{R}^{m \times n} \sim \mathcal{N}(0, \sigma^2)
&\quad \mathbf{b}^{(1)} \in \mathbb{R}^{m} \leftarrow \mathbf{0}
&\quad \mathbf{W}^{(2)} \in \mathbb{R}^{k \times m} \sim \mathcal{N}(0, \sigma^2)
&\quad \mathbf{b}^{(2)} \in \mathbb{R}^{k} \leftarrow \mathbf{0}
&\textbf{Training Loop:}
&\quad \text{FOR } \text{epoch} = 1 \text{ TO } T \text{ DO}
&\quad \quad \text{FOR each training example } (\mathbf{x}, \mathbf{y}) \text{ DO}
&\quad \quad \quad \textbf{Forward Propagation:}
&\quad \quad \quad \quad \mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}
&\quad \quad \quad \quad \mathbf{h} = f(\mathbf{z}^{(1)})
&\quad \quad \quad \quad \mathbf{z}^{(2)} = \mathbf{W}^{(2)} \mathbf{h} + \mathbf{b}^{(2)}
&\quad \quad \quad \quad \hat{\mathbf{y}} = g(\mathbf{z}^{(2)})
&\quad \quad \quad \quad L = \text{Loss}(\mathbf{y}, \hat{\mathbf{y}})
&\quad \quad \quad \textbf{Backpropagation:}
&\quad \quad \quad \quad \delta^{(2)} = \nabla{\hat{\mathbf{y}}} L \odot g'(\mathbf{z}^{(2)})
&\quad \quad \quad \quad \delta^{(1)} = (\mathbf{W}^{(2)^\top} \delta^{(2)}) \odot f'(\mathbf{z}^{(1)})
&\quad \quad \quad \textbf{Gradient Computation:}
&\quad \quad \quad \quad \nabla_{\mathbf{W}^{(2)}} L = \delta^{(2)} \mathbf{h}^\top
&\quad \quad \quad \quad \nabla_{\mathbf{b}^{(2)}} L = \delta^{(2)}
&\quad \quad \quad \quad \nabla_{\mathbf{W}^{(1)}} L = \delta^{(1)} \mathbf{x}^\top
&\quad \quad \quad \quad \nabla_{\mathbf{b}^{(1)}} L = \delta^{(1)}
&\quad \quad \quad \textbf{Parameter Update:}
&\quad \quad \quad \quad \mathbf{W}^{(2)} \leftarrow \mathbf{W}^{(2)} - \eta \nabla_{\mathbf{W}^{(2)}} L
&\quad \quad \quad \quad \mathbf{b}^{(2)} \leftarrow \mathbf{b}^{(2)} - \eta \nabla_{\mathbf{b}^{(2)}} L
&\quad \quad \quad \quad \mathbf{W}^{(1)} \leftarrow \mathbf{W}^{(1)} - \eta \nabla_{\mathbf{W}^{(1)}} L
&\quad \quad \quad \quad \mathbf{b}^{(1)} \leftarrow \mathbf{b}^{(1)} - \eta \nabla_{\mathbf{b}^{(1)}} L
&\quad \quad \text{END FOR}
&\quad \text{END FOR}
&\textbf{Output:}
&\quad \text{Trained parameters } \mathbf{W}^{(1)}, \mathbf{b}^{(1)}, \mathbf{W}^{(2)}, \mathbf{b}^{(2)}
\end{align*}
$$
-
$\mathbf{x} \in \mathbb{R}^{n}$ : Input vector. -
$\mathbf{y} \in \mathbb{R}^{k}$ : True label (one-hot encoded). -
$\hat{\mathbf{y}} \in \mathbb{R}^{k}$ : Predicted output vector. -
$f$ ,$g$ : Activation functions (e.g., ReLU, Softmax). -
$L$ : Loss function (e.g., Cross-Entropy Loss). -
$\eta$ : Learning rate. -
$\odot$ : Element-wise multiplication. -
$\mathcal{N}(0, \sigma^2)$ : Normal distribution with mean$0$ and variance$\sigma^2$ .
An MLP is a type of feedforward artificial neural network that consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. MLPs are capable of approximating complex nonlinear functions and are widely used for classification and regression tasks.
An MLP typically consists of:
- Input Layer: Receives the input features.
- Hidden Layers: One or more layers where computations are performed.
- Output Layer: Produces the final output (e.g., class probabilities).
Mathematically, an MLP with one hidden layer can be represented as:
Where:
-
$\mathbf{x} \in \mathbb{R}^{n}$ : Input vector. -
$\mathbf{W}^{(1)} \in \mathbb{R}^{m \times n}$ : Weight matrix for the input-to-hidden layer. -
$\mathbf{b}^{(1)} \in \mathbb{R}^{m}$ : Bias vector for the hidden layer. -
$f$ : Activation function for the hidden layer. -
$\mathbf{h} \in \mathbb{R}^{m}$ : Hidden layer activations. -
$\mathbf{W}^{(2)} \in \mathbb{R}^{k \times m}$ : Weight matrix for the hidden-to-output layer. -
$\mathbf{b}^{(2)} \in \mathbb{R}^{k}$ : Bias vector for the output layer. -
$g$ : Activation function for the output layer. -
$\mathbf{\hat{y}} \in \mathbb{R}^{k}$ : Predicted output vector.
Forward propagation involves computing the output of the network given an input. It is a sequence of matrix multiplications and function applications.
-
Hidden Layer Computation:
$$\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$$ $$\mathbf{h} = f\left(\mathbf{z}^{(1)}\right)$$ -
Output Layer Computation:
$$\mathbf{z}^{(2)} = \mathbf{W}^{(2)} \mathbf{h} + \mathbf{b}^{(2)}$$ $$\mathbf{\hat{y}} = g\left(\mathbf{z}^{(2)}\right)$$
Activation functions introduce non-linearity into the network, allowing it to learn complex patterns.
-
ReLU (Rectified Linear Unit):
$$f(z) = \max(0, z)$$ -
Sigmoid Function:
$$g(z) = \frac{1}{1 + e^{-z}}$$ -
Softmax Function (for multi-class classification):
$$g_i(\mathbf{z}) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}$$
The loss function quantifies the difference between the predicted output and the true output.
- Mean Squared Error (MSE) (for regression):
- Cross-Entropy Loss (for classification):
Where:
-
$\mathbf{y}$ : True labels (one-hot encoded). -
$\mathbf{\hat{y}}$ : Predicted probabilities.
Backpropagation is an algorithm used to compute the gradient of the loss function with respect to the weights of the network. It applies the chain rule of calculus to compute these gradients efficiently.
- Compute Output Error:
-
$\delta^{(2)} \in \mathbb{R}^{k}$ : Error at the output layer. $\mathbf{z}^{(2)} = \mathbf{W}^{(2)} \mathbf{h} + \mathbf{b}^{(2)}$ -
$g'$ : Derivative of the activation function$g$ . -
$\odot$ : Element-wise multiplication.
- Compute Hidden Layer Error:
-
$\delta^{(1)} \in \mathbb{R}^{m}$ : Error at the hidden layer. $\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$ -
$f'$ : Derivative of the activation function$f$ .
- Compute Gradients:
Weights are updated using gradient descent:
-
$\theta$ : Model parameters (weights and biases). -
$\eta$ : Learning rate. -
$\nabla_{\theta} L$ : Gradient of the loss function with respect to$\theta$ .
We will provide detailed mathematical derivations for the gradients with respect to the weights and biases in both layers.
Objective: Compute
Proof:
- Loss Function: For a single training example, the loss function using cross-entropy loss is:
- Predicted Output: The predicted output is:
-
Compute
$\frac{\partial L}{\partial z_i^{(2)}}$ The derivative of the loss with respect to$z_i^{(2)}$ is:
Proof:
- Using the chain rule:
- For cross-entropy loss and softmax activation, this simplifies to:
-
Compute
$\frac{\partial L}{\partial \mathbf{W}^{(2)}}$ The gradient with respect to the weights is:
Where
Derivative of the Loss with Respect to Hidden Layer Weights $\mathbf{W}^{(1)}$
Objective: Compute
Proof:
-
Compute
$\frac{\partial L}{\partial \mathbf{h}}$ From the chain rule:
-
Compute
$\delta^{(1)}$ Applying the element-wise multiplication with the derivative of the activation function:
-
Compute
$\frac{\partial L}{\partial \mathbf{W}^{(1)}}$ The gradient with respect to the weights is:
By systematically applying the chain rule, we compute the gradients of the loss with respect to each parameter in the network. This allows us to update the weights and biases during training to minimize the loss function.
While not the primary focus of this project, CNNs are another class of neural networks particularly effective for image and spatial data processing.
The convolution operation applies a kernel (filter) over the input data to extract features.
Mathematically, for a 2D convolution:
-
$I$ : Input image. -
$K$ : Kernel (filter). -
$S$ : Output feature map.
Pooling layers reduce the spatial dimensions of the data, helping to reduce overfitting and computation.
- Max Pooling:
- Average Pooling:
Where
While CPUs are optimized for sequential serial processing with complex control logic, GPUs are designed for parallel processing of large blocks of data, making them ideal for the computations involved in neural network training and inference.