-
Notifications
You must be signed in to change notification settings - Fork 8
/
recap9.html
253 lines (180 loc) · 15.2 KB
/
recap9.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
layout: page
title: Lecture 9 Recap - Neural Networks 2
mathjax: true
weight: 0
---
<section class="main-container text">
<div class="main">
<h4>Date: February 26, 2020 (<a href="https://docs.google.com/forms/d/e/1FAIpQLSf9Li4FodxBVj19HfAdd31ibKYrFtqCm9sEEEakWgURwZs97A/viewform" target="_blank">Concept Check</a>, <a href="https://docs.google.com/forms/d/e/1FAIpQLSf9Li4FodxBVj19HfAdd31ibKYrFtqCm9sEEEakWgURwZs97A/viewanalytics" target="_blank">Class Responses</a>, <a href="{{ site.baseurl }}/ccsolutions" target="_blank">Solutions</a>)</h4>
<h4>Relevant Textbook Sections: 4.4</h4>
<h4>Cube: Supervised, Discrete or Continuous, Nonprobabilistic</h4>
<br>
<h3>Lecture 9 Summary</h3>
<ul>
<li><a href="#recap9_1">Intro: More Context and Examples of Neural Networks</a></li>
<li><a href="#recap9_2">Optimizing the Neural Network</a></li>
<li><a href="#recap9_3">Detour: Vector Chain Rule</a></li>
<li><a href="#recap9_4">Let's Now Finish The Optimization</a></li>
<li><a href="#recap9_5">Let's Generalize Everything</a></li>
<li><a href="#recap9_6">Why does this all work?</a></li>
</ul>
<h2 id="recap9_1">Intro: More Context and Examples of Neural Networks</h2>
Last lecture, we described different neural network models and architectures. Today, we will be focusing on the optimization of neural networks. These are going to be difficult, but will be important to understand.<br><br>
Also, before we move on, let's cover a few more examples of the use of neural networks. One thing we can do now is predict the aftershocks of earthquakes extremely efficiently using neural networks. Before, we had to use physics data to run the simluation to understand what is going to be expected, but now we can instead use a neural network on the data, and it's about as accurate. Another example is the creation of blue LED's. It's incredibly expensive to make these lights, as you have to make them, test them, and this can cost thousands of dollars. Now, we can predict which molecules are the most likely to be effective at creating the blue LED lights. Basically, at a cheaper cost, and in a shorter time, we'll be able to find a good molecule for this purpose, and create blue LED lights much more effective. This is a great example of an intersection of engineering and machine learning.<br><br>
<h2 id="recap9_2">Optimizing the Neural Network</h2>
As usual, just as what we've done a long time ago, after we've picked our model class, and after we've picked our loss function (which we will do down below), we will need to optimize our loss function with respect to our parameters to solve for our final model.<br><br>
Ultimately, we want this specifically:<br><br>
$$\frac{\partial L_n}{\partial W}$$
To think through this, we know that we can do this because within the loss function, we'll be including the $f_n$ (see diagram from last lecture for context if you don't remeber what this is) term, and within the $f_n$ term, we can ultimately take the gradient with respect to W. We can use the chain rule to depict this:<br><br>
$$\frac{\partial L_n}{\partial W} = \frac{\partial L_n}{\partial f_n} \frac{\partial f_n}{\partial W}$$
The reason discuss this intuition first is to give context for what we're about to do below. As you can see, essentially to calculate $\frac{\partial L_n}{\partial W}$, we are going to take many steps, and one of the steps we can take first is to find the $\frac{\partial L_n}{\partial f_n}$, which we will do below for two different example loss functions. It is also important to note that this part of the equation is independent of the other derivatives, so it is totally fair game to calculate this first.<br><br>
Also in lecture today, we will cover back propagation, which relates very much to the above equation. To paraphrase Wikipedia (which I think has a great explanation): The concept of backpropagation is that we will compute the gradient of the loss (with respect to the weights) in an efficient way, which makes training networks with multiply layers actually possible. Specifically, back propagation is computing the gradient of the loss function a layer at a time, starting from the final layer (and going backwards), and then resuing terms so that we don't have to repeat any calculations (for intermediate terms in the chain rule). If you've taken CS124, this is an example of dynamic programming!<br><br>
Even though in real life, we probably will use packages to implement this and not implement backpropagation ourselves, it's good to understand the intuition behind this so we can better use the tools availble to us. By underestanding the details, we will know why are certain things slower, and why are certain things faster, etc.<br><br>
<h4>Loss Function (if regression)</h4>
First, let's define an example loss function that we might use for regression (note: we can always choose other loss functions for regression as well, it doesn't have to be least squares). Below, we have the least squares loss. Just a note, we will be as general as possible, and hence are splitting up the notes into this format of regression and classification. Also note, where we used to put in $\hat{y}$ in linear regression in the past, now we're putting in $f_n$.
$$L(w) = 1/2 * \sum(y_n - f_n)^2$$
Let us take the derivative:<br><br>
$$dL/df_n = \sum_n (y_n - f_n)(-1)$$
Note: If we swapped it to another loss function, we also note that the only effect on the whole original expression $\frac{\partial L_n}{\partial W}$ is only in this specific part, $\frac{\partial L_n}{\partial f_n}$.
<h4>Loss Function (if classification)</h4>
As mentioned before, we are being as general as possible, so we also write here a loss function for classification, as another example. Let's pick the logistic loss here as our example.<br><br>
Remember first that for our binary classification, we define the following:<br><br>
$$p(y=1|x) = \sigma(f_n)$$
$$p(y=0|x) = 1 - \sigma(f_n)$$
Now, we can define the logistic loss function using the terms above:<br><br>
$$L(w) = \sum_n y_n \log p(y=1|x) + (1 - y_n) \log p(y=0|x)$$
$$L(w) = \sum_n y_n \log \sigma(f_n) + (1 - y_n) \log (1 - \sigma(f_n))$$
Also remember the following helpful tip for the derivative:<br><br>
$$\frac{\partial\sigma(z)}{z} = \sigma(z)(1 - \sigma(z))$$
Let us take the derivative:<br><br>
$$\frac{\partial L_n}{\partial f_n} = y_n \frac{1}{\sigma(f_n)} \sigma(f_n) (1 - \sigma(f_n)) + (1 - y_n) \frac{1}{1 - \sigma(f_n)} \sigma(f_n)(1 - \sigma(f_n))(-1)$$
We can simplify:<br><br>
$$\frac{\partial L_n}{\partial f_n} = y_n (1 - \sigma(f_n)) - (1 - y_n) \sigma(f_n)$$
<h2 id="recap9_3">Detour: Vector Chain Rule</h2>
We will take a slight detour now in order to gain the tools we need to do the back propagation.
$$y = f(u) = f(g(h(x)))$$
$$u = g(v) = g(h(x))$$
$$v = h(x)$$
<h4>All Scalars</h4>
If we had all scalars, and we wanted to look for: $\frac{\partial f}{\partial x}$<br><br>
$$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial v} \frac{\partial v}{\partial x} = \frac{\partial f}{\partial u}\frac{\partial g}{\partial v}\frac{\partial h}{\partial x}$$<br><br>
<h4>If x (size $D$) was a vector </h4>
$$\nabla_\mathbf{x}f = \frac{\partial f}{\partial u} \frac{\partial u}{\partial v} \nabla_\mathbf{x}v$$
Where $\frac{\partial f}{\partial u} and \frac{\partial u}{\partial v}$ parts scalar, and
$\nabla_\mathbf{x} v$ is $1 \times D$ a vector: as in how does each dimension of $\mathbf{x}$ affect scalar $v$.
<br><br>
<strong>Note:</strong> We are using the <em>numerator layout notation</em> (sometimes known as the <em>Jacobian formulation</em>),
where $\mathbf{y}$ and $\mathbf{x}$ are size $M$ and $N$ vectors respectively and
$\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ is an $M\times N$ matrix such that
$$\frac{\partial \mathbf{y}}{\partial \mathbf{x}} =
\begin{pmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_N} \\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_N} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial y_M}{\partial x_1} & \frac{\partial y_M}{\partial x_2} & \cdots & \frac{\partial y_M}{\partial x_N}
\end{pmatrix}$$
$\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ is known as the Jacobian matrix and represented as $\mathbb{J}_\mathbf{x}[\mathbf{y}]$.
<br><br>
<h4>If x (size $D$) and v (size $J$) were both vectors</h4>
$$\nabla_\mathbf{x} f = \frac{\partial f}{\partial u} \nabla_\mathbf{v}u \mathbb{J}_\mathbf{x}[\mathbf{v}]$$
$\frac{\partial f}{\partial u}$ is a scalar, $\nabla_\mathbf{v}u$ is a $1 \times J$ vector, and
$\mathbb{J}_\mathbf{x}[\mathbf{v}]$ is a $J \times D$ matrix.
<br>
$\mathbb{J}_\mathbf{x}[\mathbf{v}]$ measures how each
dimension of $\mathbf{x}$ affects each dimension of $\mathbf{v}$ and $\nabla_\mathbf{v}u$ how each
dimension of $\mathbf{v}$ affects each dimension of $u$.
<br>
Chaining them,
$\nabla_\mathbf{v}u \mathbb{J}_\mathbf{x}[\mathbf{v}]$ ($1 \times D$ vector) measures how each
dimension of $\mathbf{x}$ affects each dimension of $u$.
<br>
Consequentially, $\frac{\partial f}{\partial u} \nabla_\mathbf{v}u \mathbb{J}_\mathbf{x}[\mathbf{v}]$ is a vector of
how each $\mathbf{x}$ dimension changes scalar $f$, through all dimensions of $\mathbf{v}$ and $u$.
<br><br>
<h4>If x (size $D$) and v (size $J$) and u (size $J^\prime$) were all vectors</h4>
$$\nabla_\mathbf{x} f = \nabla_\mathbf{u} f \mathbb{J}_\mathbf{v}[\mathbf{u}] \mathbb{J}_\mathbf{x}[\mathbf{v}]$$
$\nabla_\mathbf{u} f$ is a $1 \times J^\prime$ vector, $\mathbb{J}_\mathbf{v}[\mathbf{u}]$ is a $J^\prime \times J$ matrix,
$\mathbb{J}_\mathbf{x}[\mathbf{v}]$ is a $J\times D$ matrix.
We still get a vector of how each $\mathbf{x}$ dimension changes scalar $f$, through all dimensions of $\mathbf{v}$
and $\mathbf{u}$.
<br><br>
<h2 id="recap9_4">Let's Now Finish The Optimization</h2>
As a refresher, we've had from before:
$$\frac{\partial L_n}{\partial \mathbf{w}} = \frac{\partial L_n}{\partial f_n} \frac{\partial f_n}{\partial \mathbf{w}}$$
We already calculated $\frac{\partial L_n}{\partial f_n}$ before, so we need to now calculate $\frac{\partial f_n}{\partial \mathbf{w}}$.
Let's say our $f_n = \mathbf{w}^T\mathbf{\phi_n} + b$. If so, that means we have:
$$\frac{\partial f_n}{\partial \mathbf{w}} = \mathbf{\phi_n}$$
<h4>Example: Let's Complete the Full Expression for the Regression Case</h4>
From before for the regression case, we have $\frac{\partial L_n}{\partial f_n} = (y_n - f_n)(-1)$, and
now we have $\frac{\partial f_n}{\partial \mathbf{w}} = \mathbf{\phi_n}$, so combined we have:<br><br>
$$\frac{\partial L_n}{\partial \mathbf{w}} = (y_n - f_n)(-1) \mathbf{\phi_n}$$
<br><br>
Notice that we need the value $f_n$ to compute the gradient.
<br>
<em>How will we get this term?</em> We can first compute $\mathbf{x} \rightarrow \mathbf{\phi_n} \rightarrow \mathbf{f_n}$ and then use
that value of $\mathbf{f_n}$ in the gradient.
<br><br>
Recall that $\mathbf{\phi_n} = \sigma (\mathbf{W}^1\mathbf{x} + \mathbf{b}^1)$. <em>How can we get $\nabla_\mathbf{W}\mathcal{L}_n$</em>?
<br><br>
$$\nabla_\mathbf{W}\mathcal{L}_n = \frac{\partial L_n}{\partial f_n} \nabla_\mathbf{\phi_n}f_n \mathbb{J}_\mathbf{W}[\mathbf{\phi_n}]$$
<!-- Here we are required to take the gradient of a vector $\mathbf{\phi_n}$ with respect to a matrix $\mathbf{W}$.-->
<!-- There isn't a notation for vectors-by-matrices derivatives that is widely agreed upon. We will do this by first flattening-->
<!-- $\mathbf{W}$ from a $J \times D$ matrix into a $J*D$ vector when thinking about the derivative-->
<h2 id="recap9_5">Let's Generalize Everything</h2>
The process of computing gradients for optimizing neural networks is as follows:
<ol>
<li>
Compute $\mathcal{L}(\mathbf{\phi}^{L}(\mathbf{\phi}^{L-1}(\cdots \mathbf{\phi}^{1}(\mathbf{x}))))$ by passing $\mathbf{x}$
through the network $\mathbf{x} \rightarrow \mathbf{\phi}^{1} \rightarrow \cdots \rightarrow
\mathbf{\phi}^{L-1} \rightarrow \mathbf{\phi}^{L} \rightarrow \mathcal{L}$ (this is called a forward pass) and store all
$\mathbf{\phi}^\ell$.
</li>
<li>
Perform the chain-rule backwards:
<ul>
<li>
$$\frac{\partial \mathcal{L}}{\partial \mathbf{\phi}^\ell} = \nabla_{\mathbf{\phi}^L}\mathcal{L}
\mathbb{J}_{\mathbf{\phi}^{L-1}}[\mathbf{\phi}^{L}] \cdots \mathbb{J}_{\mathbf{\phi}^{\ell}}[\mathbf{\phi}^{\ell + 1}]$$
</li>
<li>
Compute the parameter gradients
$$\begin{align*}
\frac{\partial \mathcal{L}}{\partial \mathbf{W}^\ell} &= \nabla_{\mathbf{\phi}^L}\mathcal{L}
\mathbb{J}_{\mathbf{\phi}^{L-1}}[\mathbf{\phi}^{L}] \cdots \mathbb{J}_{\mathbf{\phi}^{\ell}}[\mathbf{\phi}^{\ell + 1}]
\mathbb{J}_{\mathbf{W}^{\ell}}[\mathbf{\phi}^{\ell}] \\
&= \frac{\partial \mathcal{L}}{\partial \mathbf{\phi}^\ell} \frac{\partial \mathbf{\phi}^\ell}{\partial \mathbf{W}^\ell}
\end{align*}$$
</li>
</ul>
</li>
</ol>
The following picture shows the steps
<div class="text-center">
<img src="{{ site.baseurl }}/images/recap9_1.png" style="width:60%" alt="Reverse-mode AutoDiff Diagram"></img>
</div>
<br><br>
<strong>Note: </strong> This is called "reverse-mode" autodiff. This is good when the output ($y$) is small and there
are many parameters (all the $W$s). It allows for efficient reuse of computed values.
<br><br>
We can also do "forward-mode" diff. Suppose we wanted just a derivative/sensitivity test on
$\frac{\partial \mathcal{L}_n}{\partial x_{nd}}$. Then we don't need all the $\frac{\partial \mathbb{\phi}}{\partial W_{jd^\prime}}$
for all $d^\prime \neq d$.
<br>
Forward mode goes from the variable $x_{nd}$ to $\mathcal{L}_n$ rather than from $\mathcal{L}_n$ to $x_{nd}$.
<br>
We compute $\frac{\partial \mathbf{\phi}^1}{\partial x_{nd}}$, then $\frac{\partial \mathbf{\phi}^2}{\partial \mathbf{\phi}^1}$
all the way up.
<h2 id="recap9_6">Why does this all work?</h2>
<ol>
<li>GPUs allow fast computations of linear algebra</li>
<li>We now have lots of data, and lots of data avoids overfitting</li>
<li>Practicalities
<ul>
<li>Theory is lagging, but belief is that the over-parameterization enables easier gradient descent (gradient descent doesn't work in simpler models)</li>
<li>Tricks: perturb inputs to avoid overfitting, stop when validation loss increases (potentially), L1 and L2 Regularization</li>
</ul>
</li>
</ol>
</div>
</section>