-
Notifications
You must be signed in to change notification settings - Fork 0
/
my-notes.qmd
283 lines (219 loc) · 7.67 KB
/
my-notes.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
title: "cs229 notes"
author: "Bhavit Sharma"
date: "today"
format:
pdf:
toc: true
number-sections: true
colorlinks: true
---
# Matrix Derivatives
Good link: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
## Some useful identities
1.
$$\frac{\partial}{\partial X} \log |X| = X^{-1}$$
Proof:
$$\frac{\partial}{\partial X} \log |X| = \frac{1}{|X|} \frac{\partial |X|}{\partial X}$$
We know that
$$
(\frac{\partial |X|}{\partial X})_{ij} = \frac{\partial}{\partial X_{ij}} * det(X)
$$
and
$$
det(X) = X_{i1} C_{i1} + X_{i2} C_{i2} + \dots + X_{in} C_{in}
$$
where $C_{ij}$ is the cofactor of $X_{ij}$.
So,
$$
\frac{\partial}{\partial X_{ij}} * det(X) = C_{ij}
$$
$$
\frac{\partial |X|}{\partial X} = C = adj(X)^T
$$
where $C$ is the cofactor matrix of $X$. $adj(X)$ is the adjugate matrix of $X$ and $X^{-1} = \frac{adjX}{|X|}$.
so we get
$$
\frac{\partial}{\partial X} \log |X| = \frac{1}{|X|} \frac{\partial |X|}{\partial X} = \frac{1}{|X|} adj(X)^T = ({X^{-1}})^T
$$
Reference: [kamper matrix calculus](https://www.kamperh.com/notes/kamper_matrixcalculus13.pdf)
2. $\frac{\partial}{\partial X} (z^TX^{-1}z) = -(X^{-1})zz^T(X^{-1})$
Proof:
$$
\frac{\partial}{\partial X} (z^TX^{-1}z)
$$
Lets first compute the derivative of $z^TX^{-1}z$ with respect to $X_{ij}$
$$
\frac{\partial}{\partial X_{ij}} (z^TX^{-1}z)
$$
Lets first derive $\frac{\partial X^{-1}}{\partial X_{ij}}$
$$
\frac{\partial X^{-1}}{\partial X_{ij}}
$$
Using $X * X^{-1} = I$ we get
$$
X^{-1}\frac{\partial X}{\partial X_{ij}} + \frac{\partial X^{-1}}{\partial X_{ij}}X = 0
$$ i.e.
$$
\frac{\partial X^{-1}}{\partial X_{ij}} = -X^{-1}\frac{\partial X}{\partial X_{ij}}X^{-1}
$$
where $\frac{\partial X}{\partial X_{ij}}$ is the matrix of partial derivatives of $X$ with respect to $X_{ij}$ and it's elements are $0$ except for the element at $i,j$ which is $1$.
So lets say $H = \frac{\partial\ tr(z^TX^{-1}z)}{\partial X}$
$$
H_{ij} = \frac{\partial}{\partial X_{ij}} tr(z^TX^{-1}z)
$$
Using cyclic property of trace we get
$$
H_{ij} = \frac{\partial}{\partial X_{ij}} tr(z^TX^{-1}z) = \frac{\partial}{\partial X_{ij}} tr(zz^T(X^{-1}))
$$
We know that
$$
\partial(Tr(A)) = Tr(\partial(A))
$$
because trace is linear.
so
$$
H_{ij} = tr(zz^T\frac{\partial}{\partial X_{ij}}(X^{-1})) = tr(zz^T(-X^{-1}\frac{\partial X}{\partial X_{ij}}X^{-1}))
$$
Using cyclic property of trace we get
$$
H_{ij} = tr(X^{-1}zz^TX^{-1}\frac{\partial X}{\partial X_{ij}})
$$
Now suppose that
$$
F = X^{-1}zz^TX^{-1}
$$
then
$$
tr(F\frac{\partial X}{\partial X_{ij}}) = F_{ji} = F_{ij}
$$
since $F$ is symmetric.
Hint: You can think of the fact only the $jth$ row of $F$ is multiplied by the $jth$
column, and only $ith$ column of $jth$ row of $F$ is multiplied by the $ith$ row of $jth$ column of $F$ leading to element at $F_{jj}$ contributing and the rest being zero.
Hence: $H = -X^{-1}zz^TX^{-1}$
# Prove that if $z^THz \geq 0$ then $H$ is positive semi-definite and cost function $J$ is convex. $H$ is Hessian matrix of $J$.
Some definitions first:
## Convex function
A function $f$ is convex if for any $x,y \in \mathbb{R}^n$ and $\alpha \in [0,1]$ we have
$$
f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)
$$
This basically means that any line segment between two points on the graph of the function lies above the graph of the function.
```{python}
#| echo: false
import numpy as np
import matplotlib.pyplot as plt
def f(x):
return x**2
# plot the function
# Lets plot the function and show that the line segment between two points lies above the graph of the function
x = np.linspace(-10, 10, 100)
y = f(x)
plt.plot(x, y)
plt.scatter(-5, 25)
plt.scatter(4, 16)
plt.plot([-5, 4], [25, 16])
# Show a line x = 2.5 which intersects the graph of the function
plt.axvline(x=2.5, color='r')
plt.show()
```
## Convex functions properties
Property 1: If $f$ is convex then $f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)$ for any $\alpha \in [0,1]$ and $x,y \in \mathbb{R}^n$.
**Proof:** Let $x,y \in \mathbb{R}$ and $\alpha \in [0,1]$. Since any point between $x$ and $y$ on the line segment $[x,y]$ is given by $\alpha x + (1-\alpha)y$, we have the following from the definition of convexity:
The value of function as point $z = \alpha x + (1-\alpha)y$ is $f(\alpha x + (1-\alpha)y)$. Now the equation of line is:
$$
y = y_1 + \frac{y_2 - y_1}{x_2 - x_1}(x - x_1)
$$
Plugging values for $z$ and $x_1, x_2, y_1, y_2$ we get:
$$
\begin{split}
y & = f(x) + \frac{f(y) - f(x)}{y - x}(\alpha.x + (1 - \alpha)y - x) \\
&= f(x) + \frac{f(y) - f(x)}{y - x}(y - x)(1 - \alpha) \\
&= f(x) (1 - 1 + \alpha) + f(y)(1 - \alpha) \\
&= f(x) (\alpha) + f(y)(1 - \alpha) \\
\end{split}
$$
So according to the definition of convexity we have:
$$
f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)
$$
**Property 2**: Suppose $f: \mathcal{R}^n \rightarrow \mathcal{R}$. Then
1. $f(y) \geq f(x) + \nabla f(x)^T(y-x)$ for all $x,y \in \mathcal{R}^n$ if and only if $f$ is convex
2. $\nabla^2 \succeq 0$ if and only if $f$ is convex
**Proof**:
```{python}
#| echo: false
import numpy
import matplotlib.pyplot as plt
# Plot a simple convex function
# Choose a point x and plot the tangent line at that point (x, f(x)) in red
# Now plot another point y
def f(x):
return x**2
def derivate(x):
return 2 * x
def tangent_line(x, y, x1):
return derivate(x) * (x1 - x) + y
x = 2
x1 = np.linspace(-10, 10, 100)
y1 = f(x1)
plt.plot(x1, y1)
# Plot the point x
plt.scatter(x, f(x))
plt.scatter(-5, 25)
# Plot the tangent line at x
y = f(x)
x2 = np.linspace(x - 5, x + 5, 100)
plt.plot(x2, tangent_line(x, y, x2), color='r')
```
1. Using the definition of convexity we have:
$$
\begin{split}
f(\alpha x + (1-\alpha)y) &\leq \alpha f(x) + (1-\alpha)f(y) \\
f(y) - f(x) \geq \frac{f(x + \alpha(y - x)) - f(x)}{\alpha} \\
\text{if $\alpha \rightarrow$ 0} \\
f(y) - f(x) \geq \nabla f(x)^T(y - x) \\
\end{split}
$$
Now we also need to prove the other direction. So suppose $f(y) \geq f(x) + \nabla f(x)^T(y-x)$ for all $x,y \in \mathcal{R}^n$. We need to prove that $f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)$.
Let's consider $z = \alpha x + (1-\alpha)y$. Then we have:
$$
\begin{split}
f(x) &\geq f(z) + \nabla f(z)^T(x - z) \\
f(y) &\geq f(z) + \nabla f(z)^T(y - z) \\
\text{Multiply first with $\alpha$ and other by $1 - \alpha$ and add} \\
\alpha f(x) + (1 - \alpha)f(y) &\geq \alpha f(z) + (1 - \alpha)f(z) + \\ \alpha \nabla f(z)^T(x - z) + (1 - \alpha)\nabla f(z)^T(y - z) \\
&\geq f(z) + \nabla f(z)^T(\alpha x + (1 - \alpha)y - z) \\
\text{Since $\alpha x + (1 - \alpha)y = z$ } \\
&\geq f(z) + \nabla f(z)^T(0) \\
&\geq f(z) \\
\end{split}
$$
2. Let's prove the second part. Suppose $\nabla^2 \succeq 0$ then we have:
Let us first prove it for $f: \mathbb{R} \rightarrow \mathbb{R}$.
Let $x, y \in dom(f)$ and $x \leq y$ then we have:
$$
\begin{split}
f(y) - f(x) &\geq f'(x)(y - x) \\
f(x) - f(y) &\geq f'(y)(x - y) \\
\implies \frac{f'(x) - f'(y)}{x - y} &\geq 0 \\
\text{if $x \rightarrow y$ then} \\
\frac{f'(x) - f'(y)}{x - y} \rightarrowtail f''(x) \geq 0 \\
\end{split}
$$
We can prove the other direction using the mean value version of the Taylor's theorem. Suppose $f''(x) \geq 0$ then there exists a point $z \in (x, y)$ such that:
$$
\begin{split}
f(y) = f(x) + f'(x)(y - x) + f''(z)\frac{(y - x)^2}{2} \\
\text{Since $f''(z) \geq 0$} \\
f(y) \geq f(x) + f'(x)(y - x) \\
\end{split}
$$
Now we need to prove the same for $f: \mathbb{R}^n \rightarrow \mathbb{R}$.
Remember that a convex function is convex along all lines.
i.e. if $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex then $g(\alpha) = f(x + \alpha(v))$ is convex for all $x \in \mathbb{R}^n$ and $v \in \mathbb{R}^n$.
$$
\begin{split}
g''(\alpha) = v^T\nabla^2 f(x + \alpha v) v \\
\end{split}
$$