class: middle, center, title-slide
Lecture 7: Generative adversarial networks
Prof. Gilles Louppe
[email protected]
???
- biggan https://openreview.net/pdf?id=B1xsqj09Fm
- stylegan
Goals: Learn models of the data itself.
- Generative models (lecture 5)
- Variational inference (lecture 5)
- Variational auto-encoders (lecture 5)
- Generative adversarial networks
class: middle
class: middle
The main idea of generative adversarial networks (GANs) is to express the task of learning a generative model as a two-player zero-sum game between two networks.
- The first network is a generator
$g(\cdot;\theta) : \mathcal{Z} \to \mathcal{X}$ , mapping a latent space equipped with a prior distribution$p(\mathbf{z})$ to the data space, thereby inducing a distribution$$\mathbf{x} \sim p(\mathbf{x};\theta) \Leftrightarrow \mathbf{z} \sim p(\mathbf{z}), \mathbf{x} = g(\mathbf{z};\theta).$$ - The second network
$d(\cdot; \phi) : \mathcal{X} \to [0,1]$ is a classifier trained to distinguish between true samples$\mathbf{x} \sim p_r(\mathbf{x})$ and generated samples$\mathbf{x} \sim p(\mathbf{x};\theta)$ .
The central mechanism will be to use supervised learning to guide the learning of the generative model.
class: middle
Consider a generator
The best classifier
Following Goodfellow et al (2014), let us define the value function
Then,
-
$V(\phi, \theta)$ is high if$d$ is good at recognizing true from generated samples. -
If
$d$ is the best classifier given$g$ , and if$V$ is high, then this implies that the generator is bad at reproducing the data distribution. -
Conversely,
$g$ will be a good generative model if$V$ is low when$d$ is a perfect opponent.
Therefore, the ultimate goal is
For a generator
Therefore,
$$\begin{aligned}
&\min_\theta \max_\phi V(\phi, \theta) = \min_\theta V(\phi^*_\theta, \theta) \\
&= \min_\theta \mathbb{E}_{\mathbf{x} \sim p_r(\mathbf{x})}\left[ \log \frac{p_r(\mathbf{x})}{p(\mathbf{x};\theta) + p_r(\mathbf{x})} \right] + \mathbb{E}_{\mathbf{x} \sim p(\mathbf{x};\theta)}\left[ \log \frac{p(\mathbf{x};\theta)}{p(\mathbf{x};\theta) + p_r(\mathbf{x})} \right] \\
&= \min_\theta \text{KL}\left(p_r(\mathbf{x}) || \frac{p_r(\mathbf{x}) + p(\mathbf{x};\theta)}{2}\right) \\
&\quad\quad\quad+ \text{KL}\left(p(\mathbf{x};\theta) || \frac{p_r(\mathbf{x}) + p(\mathbf{x};\theta)}{2}\right) -\log 4\\
&= \min_\theta 2, \text{JSD}(p_r(\mathbf{x}) || p(\mathbf{x};\theta)) - \log 4
\end{aligned}$$
where
In summary, solving the minimax problem
Since
.center[(Goodfellow et al, 2014)]
In practice, the minimax solution is approximated using alternating stochastic gradient descent, for which gradients $$\begin{aligned} \nabla_\phi V(\phi, \theta) &= \mathbb{E}_{\mathbf{x} \sim p_r(\mathbf{x})}\left[ \nabla_\phi \log d(\mathbf{x};\phi) \right] + \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}\left[ \nabla_\phi \log (1-d(g(\mathbf{z};\theta);\phi)) \right], \\ \nabla_\theta V(\phi, \theta) &= \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}\left[ \nabla_\theta \log (1-d(g(\mathbf{z};\theta);\phi)) \right], \end{aligned}$$ are approximated using Monte Carlo integration.
These noisy estimates can in turn be used alternatively
to do gradient descent on
- For one step on
$\theta$ , we can optionally take$k$ steps on$\phi$ , since we need the classifier to remain near optimal. - Note that to compute
$\nabla_\theta V(\phi, \theta)$ , it is necessary to backprop all the way through$d$ before computing the partial derivatives with respect to$g$ 's internals.
class: middle
.center[(Goodfellow et al, 2014)]
class: middle
.center[(Goodfellow et al, 2014)]
Training a standard GAN often results in pathological behaviors:
- Oscillations without convergence: contrary to standard loss minimization, alternating stochastic gradient descent has no guarantee of convergence.
- Vanishing gradient: when the classifier
$d$ is too good, the value function saturates and we end up with no gradient to update the generator (more on this later). - Mode collapse: the generator
$g$ models very well a small sub-population, concentrating on a few modes of the data distribution.
Performance is also difficult to assess in practice.
.center[Mode collapse (Metz et al, 2016)]
class: middle, center
class: middle
Deep generative architectures require layers that increase the input dimension,
i.e., that go from
- This is the opposite of what we did so far with feedforward networks, in which we reduced the dimension of the input to a few values.
- Fully connected layers could be used for that purpose but would face the same limitations as before (spatial specialization, too many parameters).
- Ideally, we would like layers that implement the inverse of convolutional and pooling layers.
For
For example, $$\begin{pmatrix} 4 & 5 & 8 & 7 \\ 1 & 8 & 8 & 8 \\ 3 & 6 & 6 & 4 \\ 6 & 5 & 7 & 8 \end{pmatrix} \star \begin{pmatrix} 1 & 4 & 1 \\ 1 & 4 & 3 \\ 3 & 3 & 1 \end{pmatrix} = \begin{pmatrix} 122 & 148 \\ 126 & 134 \end{pmatrix}$$
class: middle
The convolution operation can be equivalently re-expressed as a single matrix multiplication.
Following the previous example,
- the convolutional kernel
$\mathbf{u}$ is rearranged as a sparse Toeplitz circulant matrix, called the convolution matrix: $$\mathbf{U} = \begin{pmatrix} 1 & 4 & 1 & 0 & 1 & 4 & 3 & 0 & 3 & 3 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 4 & 1 & 0 & 1 & 4 & 3 & 0 & 3 & 3 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 4 & 1 & 0 & 1 & 4 & 3 & 0 & 3 & 3 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 4 & 1 & 0 & 1 & 4 & 3 & 0 & 3 & 3 & 1 \end{pmatrix}$$ - the input
$\mathbf{x}$ is flattened row by row, from top to bottom: $$v(\mathbf{x}) = \begin{pmatrix} 4 & 5 & 8 & 7 & 1 & 8 & 8 & 8 & 3 & 6 & 6 & 4 & 6 & 5 & 7 & 8 \end{pmatrix}^T$$
Then,
$$\mathbf{U}v(\mathbf{x}) =
\begin{pmatrix}
122 & 148 & 126 & 134
\end{pmatrix}^T$$
which we can reshape to a
???
Make diagram to obtain
The same procedure generalizes to
- the convolutional kernel is rearranged as a sparse Toeplitz circulant matrix
$\mathbf{U}$ of shape$(H-h+1)(W-w+1) \times HW$ where- each row
$i$ identifies an element of the output feature map, - each column
$j$ identifies an element of the input feature map, - the value
$\mathbf{U}_{i,j}$ corresponds to the kernel value the element$j$ is multiplied with in output$i$ ;
- each row
- the input
$\mathbf{x}$ is flattened into a column vector$v(\mathbf{x})$ of shape$HW \times 1$ ; - the output feature map
$\mathbf{x} \star \mathbf{u}$ is obtained by reshaping the$(H-h+1)(W-w+1) \times 1$ column vector$\mathbf{U}v(\mathbf{x})$ as a$(H-h+1) \times (W-w+1)$ matrix.
Therefore, a convolutional layer is a special case of a fully
connected layer:
class: middle
In a fully connected layer
Since a convolutional layer
- The backward pass takes some
$q$ -dimensional vector as input and produces some$p$ -dimensional vector as output, with$q < p$ . - It does so while keeping a connectivity pattern that is compatible with
$\mathbf{U}$ , by construction.
A transposed convolution is a convolution where the implementation of the forward and backward passes are swapped.
Therefore, a transposed convolution can be seen as the gradient of some convolution with respect to its input.
Given a convolutional kernel
- the forward pass is implemented as
$v(\mathbf{h}) = \mathbf{U}^T v(\mathbf{x})$ with appropriate reshaping, thereby effectively up-sampling an input$v(\mathbf{x})$ into a larger one; - the backward pass is computed by multiplying the loss by
$\mathbf{U}$ instead of$\mathbf{U}^T$ .
Transposed convolutions are also referred to as fractionally-stride convolutions or deconvolutions (mistakenly).
class: middle
class: middle
class: middle, center
.center[Transposed convolution (no padding, no stride)]
Given transposed convolutional layers, we are now equipped for building deep convolutional generative models.
Radford et al (2015) identify the following guidelines to ensure stable training:
class: middle, center
.center[The DCGAN generator architecture (Radford et al, 2015)]
class: middle, center
.center[(Radford et al, 2015)]
class: middle, center
.center[(Radford et al, 2015)]
class: middle, center
.center[Vector arithmetic in
.center[(Karras et al, 2017)]
class: middle
.center[
<iframe width="640" height="480" src="https://www.youtube.com/embed/XOxxPcy5Gr4?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>.center[(Karras et al, 2017)]
]
While state-of-the-art results are impressive, a close inspection of the fake samples distribution
These issues remain an open research problem.
.center.width-80[] .center[Cherry-picks (Goodfellow, 2016)]
class: middle
.center[Problems with counting (Goodfellow, 2016)]
class: middle
.center[Problems with perspective (Goodfellow, 2016)]
class: middle
.center[Problems with global structures (Goodfellow, 2016)]
class: center, middle
For most non-toy data distributions, the fake samples
Dilemma:
- If
$d$ is bad, then$g$ does not have accurate feedback and the loss function cannot represent the reality. - If
$d$ is too good, the gradients drop to 0, thereby slowing down or even halting the optimization.
For any two distributions
-
$JSD(p||q)=0$ if and only if$p=q$ , -
$JSD(p||q)=\log 2$ if and only if$p$ and$q$ have disjoint supports.
Notice how the Jensen-Shannon divergence poorly accounts for the metric structure of the space.
Intuitively, instead of comparing distributions "vertically", we would like to compare them "horizontally".
An alternative choice is the Earth mover's distance, which intuitively corresponds to the minimum mass displacement to transform one distribution into the other.
$p = \frac{1}{4}\mathbf{1}_{[1,2]} + \frac{1}{4}\mathbf{1}_{[3,4]} + \frac{1}{2}\mathbf{1}_{[9,10]}$ $q = \mathbf{1}_{[5,7]}$
Then,
.footnote[Credits: EE559 Deep Learning (Fleuret, 2018)]
The Earth mover's distance is also known as the Wasserstein-1 distance and is defined as:
-
$\Pi(p,q)$ denotes the set of all joint distributions$\gamma(x,y)$ whose marginals are respectively$p$ and$q$ ; -
$\gamma(x,y)$ indicates how much mass must be transported from$x$ to$y$ in order to transform the distribution$p$ into$q$ . -
$||\cdot||$ is the L1 norm and$||x-y||$ represents the cost of moving a unit of mass from$x$ to$y$ .
class: middle
Notice how the
For any two distributions
-
$W_1(p,q) \in \mathbb{R}^+$ , -
$W_1(p,q)=0$ if and only if$p=q$ .
Given the attractive properties of the Wasserstein-1 distance, Arjovsky et al (2017) propose
to learn a generative model by solving instead:
On the other hand, the Kantorovich-Rubinstein duality tells us that
For
.footnote[Credits: EE559 Deep Learning (Fleuret, 2018)]
Using this result, the Wasserstein GAN algorithm consists in solving the minimax problem:
- The classifier
$d:\mathcal{X} \to [0,1]$ is replaced by a critic function$d:\mathcal{X}\to \mathbb{R}$ and its output is not interpreted through the cross-entropy loss; - There is a strong regularization on the form of
$d$ . In practice, to ensure 1-Lipschitzness,- Arjovsky et al (2017) propose to clip the weights of the critic at each iteration;
- Gulrajani et al (2017) add a regularization term to the loss.
- As a result, Wasserstein GANs benefit from:
- a meaningful loss metric,
- improved stability (no mode collapse is observed).
class: middle
.center[(Arjovsky et al, 2017)]
class: middle
.center[(Arjovsky et al, 2017)]
class: middle
.center[(Arjovsky et al, 2017)]
class: middle
.center[(Arjovsky et al, 2017)]
class: center, middle
class: middle, center
.center[
.center[(Zhu et al, 2017)]
]
class: middle
.center[
<iframe width="640" height="480" src="https://www.youtube.com/embed/3AIpPlzM_qs?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>.center[(Wang et al, 2017)]
]
.center[(Shetty et al, 2017)]
.center[
.center[(Zhang et al, 2017)]
]
class: middle
.center[
.center[(Zhang et al, 2017)]
]
.center[
.center[(Lample et al, 2018)]
]
class: middle
.center[
.center[(Lample et al, 2018)]
]
.center[
.center[(Shen et al, 2018)]
]
class: middle
.center[
.center[(Shen et al, 2018)]
]
class: middle
.center[
<iframe width="640" height="480" src="https://www.youtube.com/embed/jsp1KaM-avU?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>.center[(Shen et al, 2018)]
]
class: end-slide, center count: false
The end.
- EE-559 Deep learning (Fleuret, 2018)
- Tutorial: Generative adversarial networks (Goodfellow, 2016)
- From GAN to WGAN (Weng, 2017)
- Wasserstein GAN and the Kantorovich-Rubinstein Duality (Herrmann, 2017)