Skip to content

Commit

Permalink
add figures
Browse files Browse the repository at this point in the history
  • Loading branch information
Dany-L committed Feb 28, 2023
1 parent 679bbc6 commit d2c8ad1
Show file tree
Hide file tree
Showing 6 changed files with 64 additions and 113 deletions.
2 changes: 1 addition & 1 deletion _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ author:
flickr :
facebook :
foursquare :
github : https://github.com/Dany-L
github : Dany-L
google_plus :
keybase :
instagram :
Expand Down
4 changes: 0 additions & 4 deletions _includes/head/custom.html
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,4 @@
</script>
<script src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/latest.js?config=TeX-MML-AM_CHTML' async></script>

<!-- Add TikZ support -->
<link rel="stylesheet" type="text/css" href="https://tikzjax.com/v1/fonts.css">
<script src="https://tikzjax.com/v1/tikzjax.js"></script>

<!-- end custom head snippets -->
171 changes: 63 additions & 108 deletions _posts/2022-12-22-deq_for_sysid.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,10 @@ tags:
- system identification
- equilibrium models
---
Deep equilibrium networks and their relation to system theory, part of the seminar *Machine Learning in the Sciences by [Mathias Niepert](http://www.matlog.net)*.

<!-- The code for the examples shown is available on [GitHub](https://github.com/Dany-L/RenForSysId) -->
Deep equilibrium networks and their relation to system theory, part of the seminar *Machine Learning in the Sciences by [Mathias Niepert](http://www.matlog.net)*. The code for the examples shown is available on [GitHub](https://github.com/Dany-L/RenForSysId)

# Motivation
Equilibrium network were introduced at [NeurIPS 2019](https://proceedings.neurips.cc/paper/2019/hash/01386bd6d8e091c2ab4c7c7de644d37b-Abstract.html) with the main benefit being their memory efficiency. Compared to state-of-the-art networks deep equilibrium networks could reach the same level of accuracy without storing the output of each layer to do backpropagation. In this post the goal is to stress the connection between deep equilibrium networks and how they can be applied to system identification and control. This link is also seen in a [CDC 2022](https://ieeexplore.ieee.org/abstract/document/9992684/) and [CDC 2021](https://ieeexplore.ieee.org/abstract/document/9683054/) paper.
TODO: add references

To appreciate that connection let us assume an unknown nonlinear dynamical system that can be described by a discrete differential equation

Expand All @@ -26,7 +23,7 @@ $$
\end{equation}
$$

with given initial condition $x^0$. The state is denoted by $x^k$, the input by $u^k$ and the output by $y^k$, the superscript indicates the time step of the sequence $k=1, \ldots, N$. The goal in system identification is to learn the functions $g_{\text{true}}: \mathbb{R}^{n_x} \times \mathbb{R}^{n_u} \mapsto \mathbb{R}^{n_y}$ and $f_{\text{true}}: \mathbb{R}^{n_x} \times \mathbb{R}^{n_u} \mapsto \mathbb{R}^{n_x}$ from a set of input-output measurements $\mathcal{D} = \lbrace (u, y)_i \rbrace_{i=1}^K$.
with given initial condition $x^0$. The state is denoted by $x^k$, the input by $u^k$ and the output by $y^k$, the superscript indicates the time step of the sequence $k=1, \ldots, N$. The goal in system identification is to learn the functions $g_{\text{true}}: \mathbb{R}^{n_x} \times \mathbb{R}^{n_u} \mapsto \mathbb{R}^{n_y}$ and $f_{\text{true}}: \mathbb{R}^{n_x} \times \mathbb{R}^{n_u} \mapsto \mathbb{R}^{n_x}$ from a set of input-output measurements $\mathcal{D} = \left\brace (u, y)_i \right\brace_{i=1}^K$.

The system \eqref{eq:nl_system} maps an input sequence $u$ to an output sequence $y$, recurrent neural networks are a natural fit to model sequence-to-sequence maps. From a system theoretic perspective recurrent neural networks are a discrete, linear, time-invariant system interconnected with a static nonlinearity known as the activation function, a very general formulation therefore follows as

Expand All @@ -51,7 +48,7 @@ $$
\end{equation}
$$

with $w^k = \Delta(z^k)$, the standard recurrent neural network results as a special case of this more general description, this can be seen by choosing the hidden state $h^{k} = x^{k+1}$, $\Delta(z^k) = \tanh(z^k)$ and the following parameters:
with $w^k = \Delta(z^k)$, the standard recurrent neural network (See [Equation 10](https://www.deeplearningbook.org/contents/rnn.html)) results as a special case of this more general description, this can be seen by choosing the hidden state $h^{k} = x^{k+1}$, $\Delta(z^k) = \tanh(z^k)$ and the following parameters:

$$
\begin{equation*}
Expand Down Expand Up @@ -84,119 +81,77 @@ In the next section the basic concept of deep equilibrium networks will be expla
The focus of this post is to highlight th link between deep equilibrium networks and their application to problems in system and control. Details on how to calculate the gradient and monotone operator theory are only referenced.

# Deep equilibrium networks
Consider a input sequence $u$ that is fed through a neural network with $L$ layers, on each layer $f_{\theta}^{0}(x^0, u), \ldots, f_{\theta}^{L-1}(x^{L-1}, u)$, where $x$ represents the hidden state and $f_{\theta}^i$ the activation function on each layer, the network is shown in Figure
Consider a input sequence $u$ that is fed through a neural network with $L$ layers, on each layer $f_{\theta}^{[0]}(x^0, u), \ldots, f_{\theta}^{[L-1]}(x^{L-1}, u)$, where $x$ represents the hidden state and $f_{\theta}^{[i]}$ the activation function on each layer.

![Deep forward model](/images/ren/fwd_deep.png)

The first step towards deep equilibrium networks is to tie the weights $f_{\theta}^{0}(x^0, u) = $f_{\theta}^{i}(x^0, u)$ for all $i=0, \ldots, L-1$. It turns out that this restriction does not hurt the prediction accuracy of the network, since any deep neural network can be replaced by a single layer by increasing the size of the weight (See [Appendix C](https://proceedings.neurips.cc/paper/2019/hash/01386bd6d8e091c2ab4c7c7de644d37b-Abstract.html) for details).

![Weight tied network](/images/ren/fwd_tied.png)

In a next step the number of layer is increased $L \to \infty$. The forward pass can now also be formulated as finding a fixed point $x^*$, which can be solved by a number of root fining algorithm as illustrated next.

<script type="text/tikz">
\begin{tikzpicture}[align=center]
\draw (0,0) circle (1in);
\end{tikzpicture}
</script>
![Deep equilibrium model](/images/ren/fwd_deq.png)

test python code block
## Backward pass
To train the deep equilibrium network the gradient with respect to the parameters $\theta$ needs to be calculated from the forward pass. Traditionally this is achieved by stepping trough the forward pass of the deep neural network. For deep equilibrium models however this is not desired, since the gradient should be independent of the root finding algorithm.

The loss function follows as

$$
\ell=\mathcal{L}\left(h\left(\operatorname{RootFind}\left(g_0 ; u\right)\right), y\right),
$$
with the output layer $h:\mathbb{R}^{n_z} \mapsto \mathbb{R}^{n_y}$, which can be any differentiable function (e.g. linear), $y$ is the ground-truth sequence and $\mathcal{L}:\mathbb{R}^{n_y}\times\mathbb{R}^{n_y} \mapsto \mathbb{R}$ is the loss function.

The gradient with respect to $(\cdot)$ (e.g. $\theta$) can now be calculated by implicit differentiation
$$
\frac{\partial \ell}{\partial(\cdot)}=-\frac{\partial \ell}{\partial h} \frac{\partial h}{\partial x}^{\star}\left(\left.J_{g_\theta}^{-1}\right|_{{x}^*}\right) \frac{\partial f_\theta\left(x^{\star} ; u\right)}{\partial(\cdot)},
$$
were $\left.J_{g_\theta}^{-1}\right|_{{x}^*}$ is the inverse Jacobian of $g_{\theta}$ evaluated at $x^*$

For details the gradient and how it can be calculated see [Chapter 4](http://implicit-layers-tutorial.org/deep_equilibrium_models/) of the implicit layer tutorial.

## Example
Lets make a simple example to compare a fixed layer neural network with a deep equilibrium model. We assume sequence length $T=3$, size of hidden state $n_x = 10$, input and output size $n_y = n_u = 1$. The weight are randomly initialized and the initial hidden state is set to zero $x^0 = 0$, $W_x \in \mathbb{R}^{n_x \times n_x}$, $U_x\in \mathbb{R}^{n_x \times T}$ and we take a linear output layer with $W_y \in \mathbb{R}^{n_y \times n_x}$, the biases are accordingly.

The forward pass for $L$ layers sequence-to-sequence model in PyTorch:
```python
# forward pass for fixed number of layers
z = torch.zeros(size=(1, n_z))
x = torch.tensor(u).reshape(1, n_x)
x = torch.zeros(size=(1, n_x))
u = torch.tensor(u).reshape(1, n_u)
for l in range(L):
z = nl(W_z(z) + U_z(x))
y_hat = W_y(z)
x = nl(W_x(z) + U_x(u) + b_x)
y_hat = W_y(x) + b_y
```
The forward pass for the deep equilibrium model:
```python
# DEQ
def g_theta(x):
x = x.reshape(n_x,1)
return np.squeeze(np.tanh(W_x_numpy @ z + U_x_numpy @ x + b_x_numpy) - z)

x_star, infodict, ier, mesg = fsolve(g_theta, x0=x_0, full_output=True)
x_star = z_star.reshape(n_z, 1)
y_hat_eq = W_y_numpy @ x_star + b_y_numpy
```
Note that the code are only small snippets that should give an idea on how to implement the models, the code is not supposed to run without further adjustment, for the root finding algorithm [scipy.optimize.fsolve](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fsolve.html) is used.

<script type="text/tikz">
\begin{tikzpicture}[
node distance = 0.25cm and 0.5cm,
auto,
align=center,
block/.style={
draw,
rectangle,
rounded corners,
minimum height=2em,
minimum width=2em
}
]
% blocks
\node[] (input) {};
\node[block, right= of input] (G) {$G$};
\end{tikzpicture}
</script>

<!-- <script type="text/tikz">
\begin{tikzpicture}[
node distance = 0.25cm and 0.5cm,
auto,
align=center,
block/.style={
draw,
rectangle,
rounded corners,
minimum height=2em,
minimum width=2em
}
]
% blocks
\node[] (input) {};
\node[block, right= of input] (G) {
\begin{tikzpicture}[
node distance = 0.25cm and 0.5cm,
auto,
align=center,
block/.style={
draw,
rectangle,
rounded corners,
minimum height=2em,
minimum width=2em
}
]
\node[] (inL1) {};
\node[block, right= of inL1] (L1) {$f_{\theta}^{[0]}(z_{1:T}^0; x_{1:T})$};
\node[right= of L1] (outL1) {};
\node[above= of L1] (inX) {};
\node[right= of outL1] (dots) {$\cdots$};
\node[right= of dots] (inLL) {};
\node[block, right= of inLL] (LL) {$f_{\theta}^{[L-1]}(z_{1:T}^{L-1}; x_{1:T})$};
\node[right= of LL] (outLL) {};
\node[above= of LL] (inXL) {};
% Input and outputs coordinates
% lines
\draw[->] (inX) node[right] {$x_{1:T}$} -- (L1.north);
\draw[->] (inL1) node[above] {$z_{1:T}^0$} -- (L1);
\draw[->] (L1) -- (outL1) node[above] {$z^1_{1:T}$};
\draw[->] (inXL) node[right] {$x_{1:T}$} -- (LL.north);
\draw[->] (inLL) node[above] {$z_{1:T}^{L-1}$} -- (LL);
\draw[->] (LL) -- (outLL) node[above] {$z_{1:T}^L$};
\end{tikzpicture}
};
\node at (G.north) [above] {$\mathcal{S}_{\operatorname{DEQ}}$};
\node[right= of G] (output) {};
% Input and outputs coordinates
% lines
\draw[->] (input) node[above] {$x_{1:T}, z_{1:T}^0$} -- (G);
\draw[->] (G) -- (output) node[above] {$z_{1:T}^L$} ;
\end{tikzpicture}
</script> -->

TODO: add figure.

Note that such a network matches the system \eqref{eq:nl_system}.

The first step towards deep equilibrium networks is to tie the weights $f_{\theta}^{0}(x^0, u) = $f_{\theta}^{i}(x^0, u)$ for all $i=0, \ldots, L-1$. It turns out that this restriction does not hurt the prediction accuracy of the network, since any deep neural network can be replaced by a single layer by increasing the size of the weight (See TODO for details).

The weight tied network is shown in Figure TODO.

In a next step the number of layer is increased $L \to \infty$. The forward pass can now also be formulated as finding a fixed point $z^*$, which can be solved by a number of root fining algorithm as illustrated in Figure TODO
The results for different values of $L$ are compared
```python
Number of finite layers: 0 || x^L - x^* ||^2: 0.7032
Number of finite layers: 1 || x^L - x^* ||^2: 0.3898
Number of finite layers: 2 || x^L - x^* ||^2: 0.2898
Number of finite layers: 3 || x^L - x^* ||^2: 0.1621
Number of finite layers: 4 || x^L - x^* ||^2: 0.09451
Number of finite layers: 10 || x^L - x^* ||^2: 0.001685
Number of finite layers: 20 || x^L - x^* ||^2: 7.595e-06
Number of finite layers: 30 || x^L - x^* ||^2: 7.069e-08
```
The result shows that a feed forward neural network converges to the same result as the equilibrium network if the layer size increases.

# Monotone operator equilibrium networks


# System identification with equilibrium networks


Expand Down
Binary file added images/ren/fwd_deep.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/ren/fwd_deq.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/ren/fwd_tied.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d2c8ad1

Please sign in to comment.