This repository has been archived by the owner on Jun 20, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
weight-optimization.tex
80 lines (66 loc) · 5.31 KB
/
weight-optimization.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
% Back-Propagation Neural Networks
\subsection{Back-Propagation Neural Networks}\label{subsec:back-propagation-neural-networks}
Gradient descent is known as a steepest descent method for finding the minimum of a function, which is used for finding the optimal weights for domain specific regression.
The weights are randomly initialized, and the gradient is subtracted to modify the weights $W$ while minimizing a loss function.
The loss function is constructed for measuring the error of a Neural Network's forecasts on the training set.
The regularization term $\alpha \lVert W \rVert_2^2$ is included to reduce the norm of the weights, which results in better generalization at testing.
Given forecasts $\hat{y}$, real values $y$, and weights $W$, the loss function is expressed as
\begin{equation}
L(\hat{y},y,W)=\dfrac{1}{2} \lVert \hat{y}-y \rVert_{2}^{2} + \alpha \lVert W \rVert_{2}^{2}\label{eq:equation}
\end{equation}
The next step updates the weights by computing $\nabla L_{W_i}$ individually for the weights of the network.
The momentum term $\omega \Delta W_i$ is used to accelerate the rate of change of the weights.
\begin{equation}
W_{i+1}=W_i - \eta \nabla L_{W_{i}} + \omega \Delta W_i\label{eq:equation2}
\end{equation}
The gradient is used within the gradient descent algorithm to modify the weights and achieve a local or global minimum.
The point of this process is to determine how much error each individual weight contributed and modify them in the direction of the negative gradient proportionally to the amount of error each weight contributed.
Back-Propagation is a supervised learning method, and a neural network learns the data set iteratively by performing these two steps until the stopping criteria is met.
% Extreme Learning Machines
\subsection{Extreme Learning Machines}\label{subsec:extreme-learning-machines}
\citet{Huang:2006} proposed a new approach to neural networks called an Extreme Learning Machine.
The structure of the neural network follows the same structure as Back-Propagation neural networks where there is an input layer, hidden layer, and output layer.
The difference is, Extreme Learning Machines' weights are initialized randomly within the input layer, and they are never modified.
This method creates independence between the input weights and the output weights of the neural network and allows one to solve an equation for the output weights by linear least squares.
The Extreme Learning Machine problem is posed as solving a single linear equation $H \beta = T$, where $H$ represents the hidden layer matrix, $\beta$ are the weights between the hidden and output layer, and $T$ are the target values.
The Moore-Penrose Generalized Inverse can be applied to the system to receive the least squares solution $\hat{\beta} = T {H}^\dagger$.
The solution is deterministic, models have significantly lower training times, and Extreme Learning Machines have the universal approximation property \citet{Huang2:2006}.
Extreme Learning Machines take advantage of the independent weight matrices separated by a hidden layer so that a minimum norm least-squares solution is found between the hidden and output layers; the $\beta $ weights with the smallest norm will generalize best for unseen data.
The algorithm results in a fast training algorithm, and the needed number of hidden layer nodes is always less than the number of training samples.
For $N$ arbitrary distinct samples $(x_i, t_i) $ where $x_i = \left[ x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{n}} \right] \in R^n$ and $t_i = \left[ t_{i_{1}}, t_{i_{2}}, \dots, t_{i_{m}} \right] \in R^m$, the outputs of an Extreme Learning Machines with $\tilde{N}$ hidden neurons and activation function $g(x)$ are mathematically modeled as
\begin{equation}
\sum_{i=1}^{\tilde{N}} \beta_i g (w_i \cdot x_j + b_i)=o_j,\ j=1,2, \dots N\label{eq:equation3}
\end{equation}
where $w_i = \left[ w_{i_{1}}, w_{i_{2}}, \dots, w_{i_{n}} \right]^T$ is the weight vector connecting the $i^{th}$ hidden neuron and the input neurons, $\beta_i=\left[ \beta_{i_{1}}, \beta_{i_{2}}, \dots, \beta_{i_{m}}\right]^T$ is the weight vector connecting the $i^{th}$ hidden neuron and the output neurons, and $b_i$ is the threshold of the $i^{th}$ hidden neuron.
A single hidden-layer feed-forward neural network with $\tilde{N}$ hidden neurons and activation function $g(x)$ can approximate these $N$ samples with zero error, which means there exist $\beta_i$, $w_i$, and $b_i$ such that
\begin{equation}
\sum_{ i = 1}^{\tilde{N}} \beta_i g (w_i \cdot x_j + b_i) = t_j,\ j=1,2, \dots N\label{eq:equation4}
\end{equation}
The above $N$ equations are written compactly as $H \beta = T$.
The $i^{th}$ column of $H$ is the $i^{th}$ hidden neuron’s output with respect to the inputs $x_1, x_2, \dots, x_N$.
The matrix form of these equations are expressed as
\begin{equation}
H(w_{\tilde{N}}, b_{\tilde{N}}, x_N)=
\begin{bmatrix}
g(w_1 \cdot x_1 + b_1) & \dots & g(w_{\tilde{N}} \cdot x_1 + b_{\tilde{N}} )\\
\vdots & \ddots & \vdots\\
g(w_1 \cdot x_N + b_1) & \dots & g(w_{\tilde{N}} \cdot x_N + b_{\tilde{N}} )
\end{bmatrix}\label{eq:equation5}
\end{equation}
\begin{equation}
\beta =
\begin{bmatrix}
\beta_{1}^T \\
\vdots \\
\beta_{\tilde{N}}^T
\end{bmatrix}\label{eq:equation6}
\end{equation}
\begin{equation}
T =
\begin{bmatrix}
t_{1}^T \\
\vdots \\
t_{N}^T
\end{bmatrix}\label{eq:equation7}
\end{equation}
% Methods