my-notes.qmd

---
title: "cs229 notes"
author: "Bhavit Sharma"
date: "today"
format:
  pdf:
    toc: true
    number-sections: true
    colorlinks: true
---

# Matrix Derivatives

Good link: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

## Some useful identities

1. 
$$\frac{\partial}{\partial X} \log |X| = X^{-1}$$

Proof: 
$$\frac{\partial}{\partial X} \log |X| = \frac{1}{|X|} \frac{\partial |X|}{\partial X}$$

We know that
$$
(\frac{\partial |X|}{\partial X})_{ij} = \frac{\partial}{\partial X_{ij}} * det(X)
$$

and
$$
det(X) = X_{i1} C_{i1} + X_{i2} C_{i2} + \dots + X_{in} C_{in}
$$
where $C_{ij}$ is the cofactor of $X_{ij}$.
So,
$$
\frac{\partial}{\partial X_{ij}} * det(X) = C_{ij}
$$
$$
\frac{\partial |X|}{\partial X} = C = adj(X)^T
$$
where $C$ is the cofactor matrix of $X$. $adj(X)$ is the adjugate matrix of $X$ and $X^{-1} = \frac{adjX}{|X|}$.

so we get
$$
\frac{\partial}{\partial X} \log |X| = \frac{1}{|X|} \frac{\partial |X|}{\partial X} = \frac{1}{|X|} adj(X)^T = ({X^{-1}})^T
$$

Reference: [kamper matrix calculus](https://www.kamperh.com/notes/kamper_matrixcalculus13.pdf)

2. $\frac{\partial}{\partial X} (z^TX^{-1}z) = -(X^{-1})zz^T(X^{-1})$

Proof:
$$
\frac{\partial}{\partial X} (z^TX^{-1}z)
$$

Lets first compute the derivative of $z^TX^{-1}z$ with respect to $X_{ij}$
$$
\frac{\partial}{\partial X_{ij}} (z^TX^{-1}z)
$$

Lets first derive $\frac{\partial X^{-1}}{\partial X_{ij}}$
$$
\frac{\partial X^{-1}}{\partial X_{ij}}
$$
Using $X * X^{-1} = I$ we get
$$
X^{-1}\frac{\partial X}{\partial X_{ij}} + \frac{\partial X^{-1}}{\partial X_{ij}}X = 0
$$ i.e.
$$
\frac{\partial X^{-1}}{\partial X_{ij}} = -X^{-1}\frac{\partial X}{\partial X_{ij}}X^{-1}
$$
where $\frac{\partial X}{\partial X_{ij}}$ is the matrix of partial derivatives of $X$ with respect to $X_{ij}$ and it's elements are $0$ except for the element at $i,j$ which is $1$.

So lets say $H = \frac{\partial\ tr(z^TX^{-1}z)}{\partial X}$
$$
H_{ij} = \frac{\partial}{\partial X_{ij}} tr(z^TX^{-1}z)
$$
Using cyclic property of trace we get
$$
H_{ij} = \frac{\partial}{\partial X_{ij}} tr(z^TX^{-1}z) = \frac{\partial}{\partial X_{ij}} tr(zz^T(X^{-1}))
$$

We know that
$$
\partial(Tr(A)) = Tr(\partial(A))
$$
because trace is linear.
so
$$
H_{ij} = tr(zz^T\frac{\partial}{\partial X_{ij}}(X^{-1})) = tr(zz^T(-X^{-1}\frac{\partial X}{\partial X_{ij}}X^{-1}))
$$
Using cyclic property of trace we get

$$
H_{ij} = tr(X^{-1}zz^TX^{-1}\frac{\partial X}{\partial X_{ij}})
$$

Now suppose that
$$
F = X^{-1}zz^TX^{-1}
$$
then
$$
tr(F\frac{\partial X}{\partial X_{ij}}) = F_{ji} = F_{ij}
$$
since $F$ is symmetric.
Hint: You can think of the fact only the $jth$ row of $F$ is multiplied by the $jth$
column, and only $ith$ column of $jth$ row of $F$ is multiplied by the $ith$ row of $jth$ column of $F$ leading to element at $F_{jj}$ contributing and the rest being zero.

Hence: $H = -X^{-1}zz^TX^{-1}$

# Prove that if $z^THz \geq 0$ then $H$ is positive semi-definite and cost function $J$ is convex. $H$ is Hessian matrix of $J$.

Some definitions first:

## Convex function

A function $f$ is convex if for any $x,y \in \mathbb{R}^n$ and $\alpha \in [0,1]$ we have
$$
f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)
$$

This basically means that any line segment between two points on the graph of the function lies above the graph of the function.


```{python}
#| echo: false

import numpy as np
import matplotlib.pyplot as plt


def f(x):
  return x**2


# plot the function
# Lets plot the function and show that the line segment between two points lies above the graph of the function

x = np.linspace(-10, 10, 100)
y = f(x)

plt.plot(x, y)
plt.scatter(-5, 25)
plt.scatter(4, 16)
plt.plot([-5, 4], [25, 16])

# Show a line x = 2.5 which intersects the graph of the function
plt.axvline(x=2.5, color='r')
plt.show()
```

## Convex functions properties

Property 1: If $f$ is convex then $f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)$ for any $\alpha \in [0,1]$ and $x,y \in \mathbb{R}^n$.

**Proof:** Let $x,y \in \mathbb{R}$ and $\alpha \in [0,1]$. Since any point between $x$ and $y$ on the line segment $[x,y]$ is given by $\alpha x + (1-\alpha)y$, we have the following from the definition of convexity:

The value of function as point $z = \alpha x + (1-\alpha)y$ is $f(\alpha x + (1-\alpha)y)$. Now the equation of line is:
$$
y = y_1 + \frac{y_2 - y_1}{x_2 - x_1}(x - x_1)
$$
Plugging values for $z$ and $x_1, x_2, y_1, y_2$ we get:

$$
\begin{split}
y & = f(x) + \frac{f(y) - f(x)}{y - x}(\alpha.x + (1 - \alpha)y - x) \\
  &= f(x) + \frac{f(y) - f(x)}{y - x}(y - x)(1 - \alpha) \\
  &= f(x) (1 - 1 + \alpha) + f(y)(1 - \alpha) \\
  &= f(x) (\alpha) + f(y)(1 - \alpha) \\
\end{split}
$$

So according to the definition of convexity we have:
$$
f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)
$$

**Property 2**: Suppose $f: \mathcal{R}^n \rightarrow \mathcal{R}$. Then 

1. $f(y) \geq f(x) + \nabla f(x)^T(y-x)$ for all $x,y \in \mathcal{R}^n$ if and only if $f$ is convex

2. $\nabla^2 \succeq 0$ if and only if $f$ is convex

**Proof**:

```{python}
#| echo: false

import numpy
import matplotlib.pyplot as plt


# Plot a simple convex function
# Choose a point x and plot the tangent line at that point (x, f(x)) in red
# Now plot another point y
def f(x):
  return x**2


def derivate(x):
  return 2 * x


def tangent_line(x, y, x1):
  return derivate(x) * (x1 - x) + y


x = 2
x1 = np.linspace(-10, 10, 100)
y1 = f(x1)
plt.plot(x1, y1)
# Plot the point x
plt.scatter(x, f(x))
plt.scatter(-5, 25)

# Plot the tangent line at x
y = f(x)
x2 = np.linspace(x - 5, x + 5, 100)
plt.plot(x2, tangent_line(x, y, x2), color='r')

```

1. Using the definition of convexity we have:
$$
\begin{split}
f(\alpha x + (1-\alpha)y) &\leq \alpha f(x) + (1-\alpha)f(y) \\
f(y) - f(x) \geq \frac{f(x + \alpha(y - x)) - f(x)}{\alpha} \\
\text{if $\alpha \rightarrow$ 0} \\
f(y) - f(x) \geq \nabla f(x)^T(y - x) \\
\end{split}
$$ 

Now we also need to prove the other direction. So suppose $f(y) \geq f(x) + \nabla f(x)^T(y-x)$ for all $x,y \in \mathcal{R}^n$. We need to prove that $f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)$.
Let's consider $z = \alpha x + (1-\alpha)y$. Then we have:

$$
\begin{split}
f(x) &\geq f(z) + \nabla f(z)^T(x - z) \\
f(y) &\geq f(z) + \nabla f(z)^T(y - z) \\
\text{Multiply first with $\alpha$ and other by $1 - \alpha$ and add} \\
\alpha f(x) + (1 - \alpha)f(y) &\geq \alpha f(z) + (1 - \alpha)f(z) + \\ \alpha \nabla f(z)^T(x - z) + (1 - \alpha)\nabla f(z)^T(y - z) \\
  &\geq f(z) + \nabla f(z)^T(\alpha x + (1 - \alpha)y - z) \\
\text{Since $\alpha x + (1 - \alpha)y = z$ } \\
  &\geq f(z) + \nabla f(z)^T(0) \\
  &\geq f(z) \\
\end{split}
$$

2. Let's prove the second part. Suppose $\nabla^2 \succeq 0$ then we have:

Let us first prove it for $f: \mathbb{R} \rightarrow \mathbb{R}$.
Let $x, y \in dom(f)$ and $x \leq y$ then we have:
$$
\begin{split}
f(y) - f(x) &\geq f'(x)(y - x) \\
f(x) - f(y) &\geq f'(y)(x - y) \\
\implies \frac{f'(x) - f'(y)}{x - y} &\geq 0 \\
\text{if $x \rightarrow y$ then} \\
\frac{f'(x) - f'(y)}{x - y} \rightarrowtail f''(x) \geq 0 \\
\end{split}
$$

We can prove the other direction using the mean value version of the Taylor's theorem. Suppose $f''(x) \geq 0$ then there exists a point $z \in (x, y)$ such that:

$$
\begin{split}
f(y) = f(x) + f'(x)(y - x) + f''(z)\frac{(y - x)^2}{2} \\
\text{Since $f''(z) \geq 0$} \\
f(y) \geq f(x) + f'(x)(y - x) \\
\end{split}
$$

Now we need to prove the same for $f: \mathbb{R}^n \rightarrow \mathbb{R}$.
Remember that a convex function is convex along all lines.
i.e. if $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex then $g(\alpha) = f(x + \alpha(v))$ is convex for all $x \in \mathbb{R}^n$ and $v \in \mathbb{R}^n$.

$$
\begin{split}
g''(\alpha) = v^T\nabla^2 f(x + \alpha v) v \\
\end{split}
$$