add figures

Dany-L · Feb 28, 2023 · d2c8ad1 · d2c8ad1
1 parent 679bbc6
commit d2c8ad1
Show file tree

Hide file tree

Showing 6 changed files with 64 additions and 113 deletions.
diff --git a/_config.yml b/_config.yml
@@ -99,7 +99,7 @@ author:
   flickr           :
   facebook         :
   foursquare       :
-  github           : https://github.com/Dany-L
+  github           : Dany-L
   google_plus      :
   keybase          :
   instagram        :

diff --git a/_includes/head/custom.html b/_includes/head/custom.html
@@ -33,8 +33,4 @@
 </script>
 <script src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/latest.js?config=TeX-MML-AM_CHTML' async></script>
 
-<!-- Add TikZ support  -->
-<link rel="stylesheet" type="text/css" href="https://tikzjax.com/v1/fonts.css">
-<script src="https://tikzjax.com/v1/tikzjax.js"></script>
-
 <!-- end custom head snippets -->
diff --git a/_posts/2022-12-22-deq_for_sysid.md b/_posts/2022-12-22-deq_for_sysid.md
@@ -6,13 +6,10 @@ tags:
     - system identification
     - equilibrium models
 ---
-Deep equilibrium networks and their relation to system theory, part of the seminar *Machine Learning in the Sciences by [Mathias Niepert](http://www.matlog.net)*. 
-
-<!-- The code for the examples shown is available on [GitHub](https://github.com/Dany-L/RenForSysId) -->
+Deep equilibrium networks and their relation to system theory, part of the seminar *Machine Learning in the Sciences by [Mathias Niepert](http://www.matlog.net)*. The code for the examples shown is available on [GitHub](https://github.com/Dany-L/RenForSysId)
 
 # Motivation
 Equilibrium network were introduced at [NeurIPS 2019](https://proceedings.neurips.cc/paper/2019/hash/01386bd6d8e091c2ab4c7c7de644d37b-Abstract.html) with the main benefit being their memory efficiency. Compared to state-of-the-art networks deep equilibrium networks could reach the same level of accuracy without storing the output of each layer to do backpropagation. In this post the goal is to stress the connection between deep equilibrium networks and how they can be applied to system identification and control. This link is also seen in a [CDC 2022](https://ieeexplore.ieee.org/abstract/document/9992684/) and [CDC 2021](https://ieeexplore.ieee.org/abstract/document/9683054/) paper.
-TODO: add references
 
 To appreciate that connection let us assume an unknown nonlinear dynamical system that can be described by a discrete differential equation 
 
@@ -26,7 +23,7 @@ $$
 \end{equation}
 $$
 
-with given initial condition $x^0$. The state is denoted by $x^k$, the input by $u^k$ and the output by $y^k$, the superscript indicates the time step of the sequence $k=1, \ldots, N$. The goal in system identification is to learn the functions $g_{\text{true}}: \mathbb{R}^{n_x} \times \mathbb{R}^{n_u} \mapsto \mathbb{R}^{n_y}$ and $f_{\text{true}}: \mathbb{R}^{n_x} \times \mathbb{R}^{n_u} \mapsto \mathbb{R}^{n_x}$ from a set of input-output measurements $\mathcal{D} = \lbrace (u, y)_i \rbrace_{i=1}^K$.
+with given initial condition $x^0$. The state is denoted by $x^k$, the input by $u^k$ and the output by $y^k$, the superscript indicates the time step of the sequence $k=1, \ldots, N$. The goal in system identification is to learn the functions $g_{\text{true}}: \mathbb{R}^{n_x} \times \mathbb{R}^{n_u} \mapsto \mathbb{R}^{n_y}$ and $f_{\text{true}}: \mathbb{R}^{n_x} \times \mathbb{R}^{n_u} \mapsto \mathbb{R}^{n_x}$ from a set of input-output measurements $\mathcal{D} = \left\brace (u, y)_i \right\brace_{i=1}^K$.
 
 The system \eqref{eq:nl_system} maps an input sequence $u$ to an output sequence $y$, recurrent neural networks are a natural fit to model sequence-to-sequence maps. From a system theoretic perspective recurrent neural networks are a discrete, linear, time-invariant system interconnected with a static nonlinearity known as the activation function, a very general formulation therefore follows as
 
@@ -51,7 +48,7 @@ $$
 \end{equation}
 $$
 
-with $w^k = \Delta(z^k)$, the standard recurrent neural network results as a special case of this more general description, this can be seen by choosing the hidden state $h^{k} = x^{k+1}$, $\Delta(z^k) = \tanh(z^k)$ and the following parameters:
+with $w^k = \Delta(z^k)$, the standard recurrent neural network (See [Equation 10](https://www.deeplearningbook.org/contents/rnn.html)) results as a special case of this more general description, this can be seen by choosing the hidden state $h^{k} = x^{k+1}$, $\Delta(z^k) = \tanh(z^k)$ and the following parameters:
 
 $$
 \begin{equation*}
@@ -84,119 +81,77 @@ In the next section the basic concept of deep equilibrium networks will be expla
 The focus of this post is to highlight th link between deep equilibrium networks and their application to problems in system and control. Details on how to calculate the gradient and monotone operator theory are only referenced.
 
 # Deep equilibrium networks
-Consider a input sequence $u$ that is fed through a neural network with $L$ layers, on each layer $f_{\theta}^{0}(x^0, u), \ldots, f_{\theta}^{L-1}(x^{L-1}, u)$, where $x$ represents the hidden state and $f_{\theta}^i$ the activation function on each layer, the network is shown in Figure 
+Consider a input sequence $u$ that is fed through a neural network with $L$ layers, on each layer $f_{\theta}^{[0]}(x^0, u), \ldots, f_{\theta}^{[L-1]}(x^{L-1}, u)$, where $x$ represents the hidden state and $f_{\theta}^{[i]}$ the activation function on each layer.
+
+![Deep forward model](/images/ren/fwd_deep.png)
+
+The first step towards deep equilibrium networks is to tie the weights $f_{\theta}^{0}(x^0, u) = $f_{\theta}^{i}(x^0, u)$ for all $i=0, \ldots, L-1$. It turns out that this restriction does not hurt the prediction accuracy of the network, since any deep neural network can be replaced by a single layer by increasing the size of the weight (See [Appendix C](https://proceedings.neurips.cc/paper/2019/hash/01386bd6d8e091c2ab4c7c7de644d37b-Abstract.html) for details).
+
+![Weight tied network](/images/ren/fwd_tied.png)
+
+In a next step the number of layer is increased $L \to \infty$. The forward pass can now also be formulated as finding a fixed point $x^*$, which can be solved by a number of root fining algorithm as illustrated next.
 
-<script type="text/tikz">
-  \begin{tikzpicture}[align=center]
-    \draw (0,0) circle (1in);
-  \end{tikzpicture}
-</script>
+![Deep equilibrium model](/images/ren/fwd_deq.png)
 
-test python code block
+## Backward pass
+To train the deep equilibrium network the gradient with respect to the parameters $\theta$ needs to be calculated from the forward pass. Traditionally this is achieved by stepping trough the forward pass of the deep neural network. For deep equilibrium models however this is not desired, since the gradient should be independent of the root finding algorithm.
 
+The loss function follows as
+
+$$
+\ell=\mathcal{L}\left(h\left(\operatorname{RootFind}\left(g_0 ; u\right)\right), y\right),
+$$
+with the output layer $h:\mathbb{R}^{n_z} \mapsto \mathbb{R}^{n_y}$, which can be any differentiable function (e.g. linear), $y$ is the ground-truth sequence and $\mathcal{L}:\mathbb{R}^{n_y}\times\mathbb{R}^{n_y} \mapsto \mathbb{R}$ is the loss function.
+
+The gradient with respect to $(\cdot)$ (e.g. $\theta$) can now be calculated by implicit differentiation
+$$
+\frac{\partial \ell}{\partial(\cdot)}=-\frac{\partial \ell}{\partial h} \frac{\partial h}{\partial x}^{\star}\left(\left.J_{g_\theta}^{-1}\right|_{{x}^*}\right) \frac{\partial f_\theta\left(x^{\star} ; u\right)}{\partial(\cdot)},
+$$
+were $\left.J_{g_\theta}^{-1}\right|_{{x}^*}$ is the inverse Jacobian of $g_{\theta}$ evaluated at $x^*$
+
+For details the gradient and how it can be calculated see [Chapter 4](http://implicit-layers-tutorial.org/deep_equilibrium_models/) of the implicit layer tutorial.
+
+## Example
+Lets make a simple example to compare a fixed layer neural network with a deep equilibrium model. We assume sequence length $T=3$, size of hidden state $n_x = 10$, input and output size $n_y = n_u = 1$. The weight are randomly initialized and the initial hidden state is set to zero $x^0 = 0$, $W_x \in \mathbb{R}^{n_x \times n_x}$, $U_x\in \mathbb{R}^{n_x \times T}$ and we take a linear output layer with $W_y \in \mathbb{R}^{n_y \times n_x}$, the biases are accordingly.
+
+The forward pass for $L$ layers sequence-to-sequence model in PyTorch:
 ```python
 # forward pass for fixed number of layers
-z = torch.zeros(size=(1, n_z))
-x = torch.tensor(u).reshape(1, n_x)
+x = torch.zeros(size=(1, n_x))
+u = torch.tensor(u).reshape(1, n_u)
 for l in range(L):
-    z = nl(W_z(z) + U_z(x))
-y_hat = W_y(z)
+    x = nl(W_x(z) + U_x(u) + b_x)
+y_hat = W_y(x) + b_y
 ```
+The forward pass for the deep equilibrium model:
+```python
+# DEQ
+def g_theta(x):
+    x = x.reshape(n_x,1)
+    return np.squeeze(np.tanh(W_x_numpy @ z + U_x_numpy @ x + b_x_numpy) - z)
+
+x_star, infodict, ier, mesg = fsolve(g_theta, x0=x_0, full_output=True)
+x_star = z_star.reshape(n_z, 1)
+y_hat_eq = W_y_numpy @ x_star + b_y_numpy
+```
+Note that the code are only small snippets that should give an idea on how to implement the models, the code is not supposed to run without further adjustment, for the root finding algorithm [scipy.optimize.fsolve](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fsolve.html) is used.
 
-<script type="text/tikz">
-\begin{tikzpicture}[
-    node distance = 0.25cm and 0.5cm, 
-    auto, 
-    align=center,
-    block/.style={
-        draw,
-        rectangle,
-        rounded corners,
-        minimum height=2em,
-        minimum width=2em
-    }
-]    
-    % blocks
-    \node[] (input) {};
-    \node[block, right= of input] (G) {$G$};
-\end{tikzpicture}
-</script>
-
-<!-- <script type="text/tikz">
-\begin{tikzpicture}[
-    node distance = 0.25cm and 0.5cm, 
-    auto, 
-    align=center,
-    block/.style={
-        draw,
-        rectangle,
-        rounded corners,
-        minimum height=2em,
-        minimum width=2em
-    }
-]    
-    % blocks
-    \node[] (input) {};
-    \node[block, right= of input] (G) {
-        \begin{tikzpicture}[
-            node distance = 0.25cm and 0.5cm, 
-            auto, 
-            align=center,
-            block/.style={
-                draw,
-                rectangle,
-                rounded corners,
-                minimum height=2em,
-                minimum width=2em
-            }
-        ]   
-            \node[] (inL1) {};
-            \node[block, right= of inL1] (L1) {$f_{\theta}^{[0]}(z_{1:T}^0; x_{1:T})$};
-            \node[right= of L1] (outL1) {};
-            \node[above= of L1] (inX) {};
-
-            \node[right= of outL1] (dots) {$\cdots$};
-
-            \node[right= of dots] (inLL) {};
-            \node[block, right= of inLL] (LL) {$f_{\theta}^{[L-1]}(z_{1:T}^{L-1}; x_{1:T})$};
-            \node[right= of LL] (outLL) {};
-            \node[above= of LL] (inXL) {};
-            
-            
-            % Input and outputs coordinates
-            
-            % lines
-            \draw[->] (inX) node[right] {$x_{1:T}$} -- (L1.north);
-            \draw[->] (inL1) node[above] {$z_{1:T}^0$} -- (L1);
-            \draw[->] (L1)  --  (outL1) node[above] {$z^1_{1:T}$};
-            \draw[->] (inXL) node[right] {$x_{1:T}$} -- (LL.north);
-            \draw[->] (inLL) node[above] {$z_{1:T}^{L-1}$} -- (LL);
-            \draw[->] (LL) -- (outLL) node[above] {$z_{1:T}^L$};  
-        \end{tikzpicture}
-    };
-    \node at (G.north) [above] {$\mathcal{S}_{\operatorname{DEQ}}$};
-    \node[right= of G] (output) {};
-    
-    % Input and outputs coordinates
-    
-    % lines
-    \draw[->] (input)  node[above] {$x_{1:T}, z_{1:T}^0$} -- (G);
-    \draw[->] (G) -- (output) node[above] {$z_{1:T}^L$} ;    
-\end{tikzpicture}
-</script> -->
-
-TODO: add figure. 
-
-Note that such a network matches the system \eqref{eq:nl_system}.
-
-The first step towards deep equilibrium networks is to tie the weights $f_{\theta}^{0}(x^0, u) = $f_{\theta}^{i}(x^0, u)$ for all $i=0, \ldots, L-1$. It turns out that this restriction does not hurt the prediction accuracy of the network, since any deep neural network can be replaced by a single layer by increasing the size of the weight (See TODO for details).
-
-The weight tied network is shown in Figure TODO.
-
-In a next step the number of layer is increased $L \to \infty$. The forward pass can now also be formulated as finding a fixed point $z^*$, which can be solved by a number of root fining algorithm as illustrated in Figure TODO
+The results for different values of $L$ are compared
+```python
+Number of finite layers: 0       || x^L - x^* ||^2: 0.7032
+Number of finite layers: 1       || x^L - x^* ||^2: 0.3898
+Number of finite layers: 2       || x^L - x^* ||^2: 0.2898
+Number of finite layers: 3       || x^L - x^* ||^2: 0.1621
+Number of finite layers: 4       || x^L - x^* ||^2: 0.09451
+Number of finite layers: 10      || x^L - x^* ||^2: 0.001685
+Number of finite layers: 20      || x^L - x^* ||^2: 7.595e-06
+Number of finite layers: 30      || x^L - x^* ||^2: 7.069e-08
+```
+The result shows that a feed forward neural network converges to the same result as the equilibrium network if the layer size increases.
 
 # Monotone operator equilibrium networks
 
+
 # System identification with equilibrium networks
 
 

diff --git a/images/ren/fwd_deep.png b/images/ren/fwd_deep.png
diff --git a/images/ren/fwd_deq.png b/images/ren/fwd_deq.png
diff --git a/images/ren/fwd_tied.png b/images/ren/fwd_tied.png