-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathcs189-calculus-optimization.tex
262 lines (237 loc) · 16.7 KB
/
cs189-calculus-optimization.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
Much of machine learning is about minimizing a \term{cost function} (also called an \term{objective function} in the optimization community), which is a scalar function of several variables that typically measures how poorly our model fits the data we have.
\subsection{Extrema}
Optimization is about finding \term{extrema}, which depending on the application could be minima or maxima.
When defining extrema, it is necessary to consider the set of inputs over which we're optimizing.
This set $\calX \subseteq \R^d$ is called the \term{feasible set}.
If $\calX$ is the entire domain of the function being optimized (as it often will be for our purposes), we say that the problem is \term{unconstrained}.
Otherwise the problem is \term{constrained} and may be much harder to solve, depending on the nature of the feasible set.
Suppose $f : \R^d \to \R$.
A point $\x$ is said to be a \term{local minimum} (resp. \term{local maximum}) of $f$ in $\calX$ if $f(\x) \leq f(\y)$ (resp. $f(\x) \geq f(\y)$) for all $\y$ in some neighborhood $N \subseteq \calX$ about $\x$.\footnote{
A \textbf{neighborhood} about $\x$ is an open set which contains $\x$.
}
Furthermore, if $f(\x) \leq f(\y)$ for all $\y \in \calX$, then $\x$ is a \term{global minimum} of $f$ in $\calX$ (similarly for global maximum).
If the phrase ``in $\calX$'' is unclear from context, assume we are optimizing over the whole domain of the function.
The qualifier \term{strict} (as in e.g. a strict local minimum) means that the inequality sign in the definition is actually a $>$ or $<$, with equality not allowed.
This indicates that the extremum is unique within some neighborhood.
Observe that maximizing a function $f$ is equivalent to minimizing $-f$, so optimization problems are typically phrased in terms of minimization without loss of generality.
This convention (which we follow here) eliminates the need to discuss minimization and maximization separately.
\subsection{Gradients}
The single most important concept from calculus in the context of machine learning is the \term{gradient}.
Gradients generalize derivatives to scalar functions of several variables.
The gradient of $f : \R^d \to \R$, denoted $\nabla f$, is given by
\[\nabla f = \matlit{\pdv{f}{x_1} \\ \vdots \\ \pdv{f}{x_n}}
\tab\text{i.e.}\tab
[\nabla f]_i = \pdv{f}{x_i}\]
Gradients have the following very important property: $\nabla f(\x)$ points in the direction of \term{steepest ascent} from $\x$.
Similarly, $-\nabla f(\x)$ points in the direction of \term{steepest descent} from $\x$.
We will use this fact frequently when iteratively minimizing a function via \term{gradient descent}.
\subsection{The Jacobian}
The \term{Jacobian} of $f : \R^n \to \R^m$ is a matrix of first-order partial derivatives:
\[\mat{J}_f = \matlit{
\pdv{f_1}{x_1} & \hdots & \pdv{f_1}{x_n} \\
\vdots & \ddots & \vdots \\
\pdv{f_m}{x_1} & \hdots & \pdv{f_m}{x_n}}
\tab\text{i.e.}\tab
[\mat{J}_f]_{ij} = \pdv{f_i}{x_j}\]
Note the special case $m = 1$, where $\nabla f = \mat{J}_f\tran$.
\subsection{The Hessian}
The \term{Hessian} matrix of $f : \R^d \to \R$ is a matrix of second-order partial derivatives:
\[\nabla^2 f = \matlit{
\pdv[2]{f}{x_1} & \hdots & \pdv{f}{x_1}{x_d} \\
\vdots & \ddots & \vdots \\
\pdv{f}{x_d}{x_1} & \hdots & \pdv[2]{f}{x_d}}
\tab\text{i.e.}\tab
[\nabla^2 f]_{ij} = {\pdv{f}{x_i}{x_j}}\]
Recall that if the partial derivatives are continuous, the order of differentiation can be interchanged (Clairaut's theorem), so the Hessian matrix will be symmetric.
This will typically be the case for differentiable functions that we work with.
The Hessian is used in some optimization algorithms such as Newton's method.
It is expensive to calculate but can drastically reduce the number of iterations needed to converge to a local minimum by providing information about the curvature of $f$.
\subsection{Matrix calculus}
Since a lot of optimization reduces to finding points where the gradient vanishes, it is useful to have differentiation rules for matrix and vector expressions.
We give some common rules here.
Probably the two most important for our purposes are
\begin{align*}
\nabla_\x &(\vec{a}\tran\x) = \vec{a} \\
\nabla_\x &(\x\tran\A\x) = (\A + \A\tran)\x
\end{align*}
Note that this second rule is defined only if $\A$ is square.
Furthermore, if $\A$ is symmetric, we can simplify the result to $2\A\x$.
\subsubsection{The chain rule}
Most functions that we wish to optimize are not completely arbitrary functions, but rather are composed of simpler functions which we know how to handle.
The chain rule gives us a way to calculate derivatives for a composite function in terms of the derivatives of the simpler functions that make it up.
The chain rule from single-variable calculus should be familiar:
\[(f \circ g)'(x) = f'(g(x))g'(x)\]
where $\circ$ denotes function composition.
There is a natural generalization of this rule to multivariate functions.
\begin{proposition}
Suppose $f : \R^m \to \R^k$ and $g : \R^n \to \R^m$. Then $f \circ g : \R^n \to \R^k$ and
\[\mat{J}_{f \circ g}(\x) = \mat{J}_f(g(\x))\mat{J}_g(\x)\]
\end{proposition}
In the special case $k = 1$ we have the following corollary since $\nabla f = \mat{J}_f\tran$.
\begin{corollary}
Suppose $f : \R^m \to \R$ and $g : \R^n \to \R^m$. Then $f \circ g : \R^n \to \R$ and
\[\nabla (f \circ g)(\x) = \mat{J}_g(\x)\tran \nabla f(g(\x))\]
\end{corollary}
\subsection{Taylor's theorem}
Taylor's theorem has natural generalizations to functions of more than one variable.
We give the version presented in \cite{numopt}.
\begin{theorem}
(Taylor's theorem)
Suppose $f : \R^d \to \R$ is continuously differentiable, and let $\h \in \R^d$.
Then there exists $t \in (0,1)$ such that
\[f(\x + \h) = f(\x) + \nabla f(\x + t\h)\tran\h\]
Furthermore, if $f$ is twice continuously differentiable, then
\[\nabla f(\x + \h) = \nabla f(\x) + \int_0^1 \nabla^2 f(\x + t\h)\h \dd{t}\]
and there exists $t \in (0,1)$ such that
\[f(\x + \h) = f(\x) + \nabla f(\x)\tran\h + \frac{1}{2}\h\tran\nabla^2f(\x+t\h)\h\]
\end{theorem}
This theorem is used in proofs about conditions for local minima of unconstrained optimization problems.
Some of the most important results are given in the next section.
\subsection{Conditions for local minima}
\begin{proposition}
If $\x^*$ is a local minimum of $f$ and $f$ is continuously differentiable in a neighborhood of $\x^*$, then $\nabla f(\x^*) = \vec{0}$.
\end{proposition}
\begin{proof}
Let $\x^*$ be a local minimum of $f$, and suppose towards a contradiction that $\nabla f(\x^*) \neq \vec{0}$.
Let $\h = -\nabla f(\x^*)$, noting that by the continuity of $\nabla f$ we have
\[\lim_{t \to 0} -\nabla f(\x^* + t\h) = -\nabla f(\x^*) = \h\]
Hence
\[\lim_{t \to 0} \h\tran\nabla f(\x^* + t\h) = \h\tran\nabla f(\x^*) = -\|\h\|_2^2 < 0\]
Thus there exists $T > 0$ such that $\h\tran\nabla f(\x^* + t\h) < 0$ for all $t \in [0,T]$.
Now we apply Taylor's theorem: for any $t \in (0,T]$, there exists $t' \in (0,t)$ such that
\[f(\x^* + t\h) = f(\x^*) + t\h\tran \nabla f(\x^* + t'\h) < f(\x^*)\]
whence it follows that $\x^*$ is not a local minimum, a contradiction.
Hence $\nabla f(\x^*) = \vec{0}$.
\end{proof}
The proof shows us why the vanishing gradient is necessary for an extremum: if $\nabla f(\x)$ is nonzero, there always exists a sufficiently small step $\alpha > 0$ such that $f(\x - \alpha\nabla f(\x))) < f(\x)$.
For this reason, $-\nabla f(\x)$ is called a \term{descent direction}.
Points where the gradient vanishes are called \term{stationary points}.
Note that not all stationary points are extrema.
Consider $f : \R^2 \to \R$ given by $f(x,y) = x^2 - y^2$.
We have $\nabla f(\vec{0}) = \vec{0}$, but the point $\vec{0}$ is the minimum along the line $y = 0$ and the maximum along the line $x = 0$.
Thus it is neither a local minimum nor a local maximum of $f$.
Points such as these, where the gradient vanishes but there is no local extremum, are called \term{saddle points}.
We have seen that first-order information (i.e. the gradient) is insufficient to characterize local minima.
But we can say more with second-order information (i.e. the Hessian).
First we prove a necessary second-order condition for local minima.
\begin{proposition}
If $\x^*$ is a local minimum of $f$ and $f$ is twice continuously differentiable in a neighborhood of $\x^*$, then $\nabla^2 f(\x^*)$ is positive semi-definite.
\end{proposition}
\begin{proof}
Let $\x^*$ be a local minimum of $f$, and suppose towards a contradiction that $\nabla^2 f(\x^*)$ is not positive semi-definite.
Let $\h$ be such that $\h\tran\nabla^2 f(\x^*)\h < 0$, noting that by the continuity of $\nabla^2 f$ we have
\[\lim_{t \to 0} \nabla^2 f(\x^* + t\h) = \nabla^2 f(\x^*)\]
Hence
\[\lim_{t \to 0} \h\tran\nabla^2 f(\x^* + t\h)\h = \h\tran\nabla^2 f(\x^*)\h < 0\]
Thus there exists $T > 0$ such that $\h\tran\nabla^2 f(\x^* + t\h)\h < 0$ for all $t \in [0,T]$.
Now we apply Taylor's theorem: for any $t \in (0,T]$, there exists $t' \in (0,t)$ such that
\[f(\x^* + t\h) = f(\x^*) + \underbrace{t\h\tran\nabla f(\x^*)}_0 + \frac{1}{2}t^2\h\tran\nabla^2 f(\x^* + t'\h)\h < f(\x^*)\]
where the middle term vanishes because $\nabla f(\x^*) = \vec{0}$ by the previous result.
It follows that $\x^*$ is not a local minimum, a contradiction.
Hence $\nabla^2 f(\x^*)$ is positive semi-definite.
\end{proof}
Now we give sufficient conditions for local minima.
\begin{proposition}
Suppose $f$ is twice continuously differentiable with $\nabla^2 f$ positive semi-definite in a neighborhood of $\x^*$, and that $\nabla f(\x^*) = \vec{0}$.
Then $\x^*$ is a local minimum of $f$.
Furthermore if $\nabla^2 f(\x^*)$ is positive definite, then $\x^*$ is a strict local minimum.
\end{proposition}
\begin{proof}
Let $B$ be an open ball of radius $r > 0$ centered at $\x^*$ which is contained in the neighborhood.
Applying Taylor's theorem, we have that for any $\h$ with $\|\h\|_2 < r$, there exists $t \in (0,1)$ such that
\[f(\x^* + \h) = f(\x^*) + \underbrace{\h\tran\nabla f(\x^*)}_0 + \frac{1}{2}\h\tran\nabla^2 f(\x^* + t\h)\h \geq f(\x^*)\]
The last inequality holds because $\nabla^2 f(\x^* + t\h)$ is positive semi-definite (since $\|t\h\|_2 = t\|\h\|_2 < \|\h\|_2 < r$), so $\h\tran\nabla^2 f(\x^* + t\h)\h \geq 0$.
Since $f(\x^*) \leq f(\x^* + \h)$ for all directions $\h$ with $\|\h\|_2 < r$, we conclude that $\x^*$ is a local minimum.
Now further suppose that $\nabla^2 f(\x^*)$ is strictly positive definite.
Since the Hessian is continuous we can choose another ball $B'$ with radius $r' > 0$ centered at $\x^*$ such that $\nabla^2 f(\x)$ is positive definite for all $\x \in B'$.
Then following the same argument as above (except with a strict inequality now since the Hessian is positive definite) we have $f(\x^* + \h) > f(\x^*)$ for all $\h$ with $0 < \|\h\|_2 < r'$.
Hence $\x^*$ is a strict local minimum.
\end{proof}
Note that, perhaps counterintuitively, the conditions $\nabla f(\x^*) = \vec{0}$ and $\nabla^2 f(\x^*)$ positive semi-definite are not enough to guarantee a local minimum at $\x^*$!
Consider the function $f(x) = x^3$.
We have $f'(0) = 0$ and $f''(0) = 0$ (so the Hessian, which in this case is the $1 \times 1$ matrix $\matlit{0}$, is positive semi-definite).
But $f$ has a saddle point at $x = 0$.
The function $f(x) = -x^4$ is an even worse offender -- it has the same gradient and Hessian at $x = 0$, but $x = 0$ is a strict local maximum for this function!
For these reasons we require that the Hessian remains positive semi-definite as long as we are close to $\x^*$.
Unfortunately, this condition is not practical to check computationally, but in some cases we can verify it analytically (usually by showing that $\nabla^2 f(\x)$ is p.s.d. for all $\x \in \R^d$).
Also, if $\nabla^2 f(\x^*)$ is strictly positive definite, the continuity assumption on $f$ implies this condition, so we don't have to worry.
\subsection{Convexity}
\input{cs189-convexity.tex}
\subsection{Orthogonal projections}
We now consider a particular kind of optimization problem that is particularly well-understood and can often be solved in closed form: given some point $\x$ in an inner product space $V$, find the closest point to $\x$ in a subspace $S$ of $V$.
This process is referred to as \term{projection onto a subspace}.
The following diagram should make it geometrically clear that, at least in Euclidean space, the solution is intimately related to orthogonality and the Pythagorean theorem:
\begin{center}
\includegraphics[width=0.5\linewidth]{orthogonal-projection}
\end{center}
Here $\y$ is an arbitrary element of the subspace $S$, and $\y^*$ is the point in $S$ such that $\x-\y^*$ is perpendicular to $S$.
The hypotenuse of a right triangle (in this case $\|\x-\y\|$) is always longer than either of the legs (in this case $\|\x-\y^*\|$ and $\|\y^*-\y\|$), and when $\y \neq \y^*$ there always exists such a triangle between $\x$, $\y$, and $\y^*$.
Our intuition from Euclidean space suggests that the closest point to $\x$ in $S$ has the perpendicularity property described above, and we now show that this is indeed the case.
\begin{proposition}
Suppose $\x \in V$ and $\y \in S$.
Then $\y^*$ is the unique minimizer of $\|\x-\y\|$ over $\y \in S$ if and only if $\x-\y^* \perp S$.
\end{proposition}
\begin{proof}
$(\implies)$
Suppose $\y^*$ is the unique minimizer of $\|\x-\y\|$ over $\y \in S$.
That is, $\|\x-\y^*\| \leq \|\x-\y\|$ for all $\y \in S$, with equality only if $\y = \y^*$.
Fix $\vec{v} \in S$ and observe that
\begin{align*}
g(t) &:= \|\x-\y^*+t\vec{v}\|^2 \\
&= \inner{\x-\y^*+t\vec{v}}{\x-\y^*+t\vec{v}} \\
&= \inner{\x-\y^*}{\x-\y^*} - 2t\inner{\x-\y^*}{\vec{v}} + t^2\inner{\vec{v}}{\vec{v}} \\
&= \|\x-\y^*\|^2 - 2t\inner{\x-\y^*}{\vec{v}} + t^2\|\vec{v}\|^2
\end{align*}
must have a minimum at $t = 0$ as a consequence of this assumption.
Thus
\[0 = g'(0) = \left.-2\inner{\x-\y^*}{\vec{v}} + 2t\|\vec{v}\|^2\right|_{t=0} = -2\inner{\x-\y^*}{\vec{v}}\]
giving $\x-\y^* \perp \vec{v}$.
Since $\vec{v}$ was arbitrary in $S$, we have $\x-\y^* \perp S$ as claimed.
$(\impliedby)$
Suppose $\x-\y^* \perp S$.
Observe that for any $\y \in S$, $\y^*-\y \in S$ because $\y^* \in S$ and $S$ is closed under subtraction.
Under the hypothesis, $\x-\y^* \perp \y^*-\y$, so by the Pythagorean theorem,
\[\|\x-\y\| = \|\x-\y^*+\y^*-\y\| = \|\x-\y^*\| + \|\y^*-\y\| \geq \|\x - \y^*\|\]
and in fact the inequality is strict when $\y \neq \y^*$ since this implies $\|\y^*-\y\| > 0$.
Thus $\y^*$ is the unique minimizer of $\|\x-\y\|$ over $\y \in S$.
\end{proof}
Since a unique minimizer in $S$ can be found for any $\x \in V$, we can define an operator
\[P\x = \argmin_{\y \in S} \|\x-\y\|\]
Observe that $P\y = \y$ for any $\y \in S$, since $\y$ has distance zero from itself and every other point in $S$ has positive distance from $\y$.
Thus $P(P\x) = P\x$ for any $\x$ (i.e., $P^2 = P$) because $P\x \in S$.
The identity $P^2 = P$ is actually one of the defining properties of a \term{projection}, the other being linearity.
An immediate consequence of the previous result is that $\x - P\x \perp S$ for any $\x \in V$, and conversely that $P$ is the unique operator that satisfies this property for all $\x \in V$.
For this reason, $P$ is known as an \term{orthogonal projection}.
If we choose an orthonormal basis for the target subspace $S$, it is possible to write down a more specific expression for $P$.
\begin{proposition}
If $\e_1, \dots, \e_m$ is an orthonormal basis for $S$, then
\[P\x = \sum_{i=1}^m \inner{\x}{\e_i}\e_i\]
\end{proposition}
\begin{proof}
Let $\e_1, \dots, \e_m$ be an orthonormal basis for $S$, and suppose $\x \in V$.
Then for all $j = 1, \dots, m$,
\begin{align*}
\biginner{\x-\sum_{i=1}^m \inner{\x}{\e_i}\e_i}{\e_j} &= \inner{\x}{\e_j} - \sum_{i=1}^m \inner{\x}{\e_i}\underbrace{\inner{\e_i}{\e_j}}_{\delta_{ij}} \\
&= \inner{\x}{\e_j} - \inner{\x}{\e_j} \\
&= 0
\end{align*}
We have shown that the claimed expression, call it $\tilde{P}\x$, satisfies $\x - \tilde{P}\x \perp \e_j$ for every element $\e_j$ of the orthonormal basis for $S$.
It follows (by linearity of the inner product) that $\x - \tilde{P}\x \perp S$, so the previous result implies $P = \tilde{P}$.
\end{proof}
The fact that $P$ is a linear operator (and thus a proper projection, as earlier we showed $P^2 = P$) follows readily from this result.
%Another useful fact about the orthogonal projection operator is that the metric it induces is \term{non-expansive}, i.e. $1$-Lipschitz.
%\begin{proposition}
%For any $\x \in V$,
%\[\|P\x\| \leq \|\x\|\]
%Thus for any $\x, \xye \in V$,
%\[\|P\x - P\xye\| \leq \|\x-\xye\|\]
%\end{proposition}
%\begin{proof}
%Suppose $\x \in V$.
%Then
%\[\|P\x\|^2 = \inner{P\x}{P\x} = \inner{\x}{P^2\x} = \inner{\x}{P\x} \leq \|\x\|\|P\x\|\]
%using respectively the self-adjointness of $P$, the fact that $P^2 = P$, and the Cauchy-Schwarz inequality.
%If $\|P\x\| = 0$, the inequality holds vacuously; otherwise we can divide both sides by $\|P\x\|$ to obtain $\|P\x\| \leq \|\x\|$.
%
%The second statement follows immediately from the first by linearity of $P$.
%\end{proof}