Skip to content

oss-roettger/StyleGAN2-In-a-Nutshell

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

StyleGAN2 in a Nutshell

  • Best in class GAN for artificially synthesizing natural-looking images by means of Machine Learning

  • There are approx. 50 pre-trained StyleGAN2 domain models available publicly

  • Each StyleGAN2 domain model can synthesize endless variations of images within the respective domain

  • Invented by NVIDIA 2020, see teaser video & source code

  • New Version "Alias-Free GAN" expected in September 2021

  • Examples from afhqwild.pkl and ffhq.pkl model data (generated on my own implementation):

    wild ffhq-512

Legal information: All content of "StyleGAN2 in a Nutshell" is © 2021 HANS ROETTGER. You may use parts of it in your own publications if you mark them clearly with "source: [email protected]"

What is a GAN - Generative Adversarial Network?

  • First invented at Université de Montreal in 2014 (see evolution of GANs)

  • Generative Network: A Generator synthesizes fake objects similar to a Training Set (domain)

  • Adversarial: the Generator and a Discriminator component compete each other and hence become both better and better with each optimization step in a learning process

  • Even when the Generator & the Discriminator synthesize & assess totally random in the beginning, eventually the Generator will have learned to synthesize fake objects that could hardly be distinguished from the Training Set.

    GAN

The StyleGAN2 Learning Process

  • Needs approx. 10,000 images in the Training Set

  • Takes a few days of computing time

  • Video below: learning process based on 11,300 fashion images fed 200 times (= "epochs") to the Discriminator resulting in more than 2 million optimization steps for the Discriminator and the Generator. The video shows the generated fake images after each learning epoch.

    learning

The StyleGAN2 Discriminator: Image ➡️ Assessment

  • Reviews FAKE & real images and adjusts assessment criteria in each learning step so that FAKE images get lower results than real images (in average)

  • Implemented as many other DNN regcognizers: consecutive filters (3x3 Convolutions) to generate more and more abstract image features and a dense layer in the end to map features to a scalar value

  • The discriminator assessments results are neither absolute, nor constant for a unique image, but change with each learning step. The video below shows that a fixed sample of real images gets different assessment results over time as the discriminater tries to keep up with a generator getting better and better:

    Assessment

StyleGAN2 Special Properties: z ➡️ Image

  • The StyleGAN2 Generator synthesizes outputs based on a small input vector z "modulation“ (typical size: 2048 Bytes)
  • Different values of z generate different outputs
  • Therefore z could be interpreted as a very compressed representation of the synthesized output
  • For almost all natural images there exists a generating z
  • Most important: Similar z generate similar output objects! Accordingly a linear combination of two input vectors z1 & z2 results in an output object in between the outputs generated by z1 & z2

interpolation interpolation

The StyleGAN2 Generator in Detail

The StyleGAN2 Generator has no clue WHAT it is synthesizing (no hidden 3D geometrical model, no part/whole relationships, no lighting models, no nothing). It is just adding and removing dabs of paint, starting at a very coarse resolution and adding finer dabs in consecutive layers - similar to the wet-on-wet painting technique of Bob Ross.

z (mapped to w) ➡️ Image

Generator 64x64xRGBA image generator with 5 layers that has learned to generate emojis. All output layers have been normalized to visualize the full information contained (different ranges of values in reality). An other example with 7 layers: 256x256xRGB

  • The output image is synthesized as the sum of the consecutive image layers L

  • Each image layer L is a projection P of a higher dimensional data space into the desired number of image channels C

  • The higher dimensional data space for each L is generated by convolution filters C1, C2 from its predecessor

  • The Input vector w “modulates“ each convolution filter and the projection! (w is a mapped version of z to achieve equal distribution in image space)

  • Additionally noise is added in each layer to generate more image variations. Final result of applying noise to different layers:

    L0: N0 L1: N1 L2: N2 L3: N3 L4: N4

Own Experiments & Experience

  • Reimplemented StyleGAN2 from scratch to understand how it works and to eliminate some flaws of NVIDIAs reference implementation
  • NVIDIA implementation flaws: outdated Tensorflow version, squared RGB Images only, proprietary & non-transparent dnnlib, mode collapse tendency, bad CPU inference
  • Own Implementation: slightly slower, needs much more GPU memory, but other flaws eliminated!
  • Starting point: collected and applied the pre-trained domain models available on public websites
  • Training of own StyleGAN2 domain models needs high amount of computational resources and well prepared Training Sets (at least 5000 images; StyleGAN ADA claims to work with 2000 images)
  • Training Sets need to have a lot of variation within but should not be too diverse. Spacial alignment is also critical, since neural networks do not cope with translations well. Upcoming "Alias-Free GAN" claims to deal better with translations and rotations

What Training Sets Worked Well?

Catalog images and consecutive frames from video clips!

  • Fashion-MNIST Model (24x24xGray, Training Set size: 6000 source)

    MNIST Emoji1

  • Emoji Model (64x64xRGBA, Training Set size: 6620 source)

    Emoji1 Emoji1

  • Fashion Model (128x64xRGB, Training Set size: 11379 source)

    Dreses1 Dresses2

  • Rain Drops Model (64x128xGray, 3800 images from own video clip)

    Rain1

What Did Not Work?

Too much diversity in the Training Set.

  • Stamp Model (48x80xRGB, 11491 images from various stamp catalogs) In this example StyleGAN3 learned that a stamp has pips, a small border and typical color schemes but the motifs were too divers to be learned:

    Stamps

Machine Learning Development Environment

  • Tensorflow (Google) vs. PyTorch (Facebook); ML Abstraction Layer Keras

  • Use GPU acceleration!

    • 10x Computational Power in float32 (standard for ML)
    • BUT slower than CPU in float64 (scientific applications)
    • 10x faster memory access. Get as much GPU memory as possible!
    • Slower GPU just needs more time, but if ML network does not fit into GPU memory, you can‘t use it at all.
  • My development environment

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published