This repo facilitates the encoding of images into the latent space of StyleGAN. It's based on /u/Puzer's StyleGAN Encoder repo with the following changes:
- The encoded latent vector is of shape [1, 512] rather than [18, 512].
- The encoding uses perceptual loss based on the network activations of the StyleGAN discriminator network, so in theory the reconstruction technique should work on StyleGAN irrespective of the domain of training images, and in theory the encoding should focus on the same type of image features that distinguish images within the training set from one another. Maybe this means that e.g. encoding face images will be less distracted by background features. (I am not aware of prior uses of perceptual loss from a GAN discriminator network.)
- ADAM optimizer is used with a decayed learning rate.
- Stochastic clipping is employed, in similar fashion as described in Precise Recovery of Latent Vectors from Generative Adversarial Networks (Lipton & Tripathi 2017).
Sample images: Left column are target images, right column are reconstructions based on embedding. Top five target images were outputs from the original StyleGAN network, and the bottom two are real life photos of celebrities.
Videos of the training of each encoding are available in the generated_videos directory.
Why limit the encoded latent vectors to shape [1, 512] rather than [18, 512]?
- The mapping network of the original StyleGAN outputs [1, 512] latent vectors, suggesting that the reconstructed images may better resemble the natural outputs of the StyleGAN network.
- Style mixing proceeds by combining multiple latent vectors into a composite [18, 512] latent vector, which isn't straightforward when individual image encodings are already of shape [18, 512].
- Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? (Abdal, Qin & Wonka 2019) demonstrated that use of the full [18, 512] latent space allows all manner of images to be reproduced by the pretrained StyleGAN network, even images highly dissimilar to training data, perhaps suggesting that the accuracy of the encoded images more reflects the amount of freedom afforded by the expanded latent vector than the domain expertise of the network.
The code is used in much the same way as /u/Puzer's StyleGAN Encoder repo except that command line arguments have been removed and replaced with hardwired directories in the python code files.
Original readme for /u/Puzer's StyleGAN Encoder repo follows:
These people are real – latent representation of them was found by using perceptual loss trick. Then this representations were moved along "smiling direction" and transformed back into images
Short explanation of encoding approach: 0) Original pre-trained StyleGAN generator is used for generating images
- Pre-trained VGG16 network is used for transforming a reference image and generated image into high-level features space
- Loss is calculated as a difference between them in the features space
- Optimization is performed only for latent representation which we want to obtain.
- Upon completion of optimization you are able to transform your latent vector as you wish. For example you can find a "smiling direction" in your latent space, move your latent vector in this direction and transform it back to image using the generator.
New scripts for finding your own directions will be realised soon. For now you can play with existing ones: smiling, age, gender. More examples you can find in the Jupyter notebook
You can generate latent representations of your own images using two scripts:
- Extract and align faces from images
python raw_images/ aligned_images/
- Find latent representation of aligned images
python aligned_images/ generated_images/ latent_representations/
- Then you can play with Jupyter notebook
Feel free to join the research. There is still much room for improvement:
- Better model for perceptual loss
- Is it possible to generate latent representations by using other model instead of direct optimization ? (WIP)
Stay tuned!
This repository contains (no longer) official TensorFlow implementation of the following paper:
A Style-Based Generator Architecture for Generative Adversarial Networks
Tero Karras (NVIDIA), Samuli Laine (NVIDIA), Timo Aila (NVIDIA) We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.
For business inquiries, please contact [email protected]
For press and other inquiries, please contact Hector Marinez at [email protected]
All material related to our paper is available via the following links:
Link | Description |
---|---| | Paper PDF. | | Result video. | | Source code. | | Flickr-Faces-HQ dataset. | | Google Drive folder. |
Additional material can be found in Google Drive folder:
Path | Description |
StyleGAN | Main folder. |
├ stylegan-paper.pdf | High-quality version of the paper PDF. |
├ stylegan-video.mp4 | High-quality version of the result video. |
├ images | Example images produced by our generator. |
│ ├ representative-images | High-quality images to be used in articles, blog posts, etc. |
│ └ 100k-generated-images | 100,000 generated images for different amounts of truncation. |
│ ├ ffhq-1024x1024 | Generated using Flickr-Faces-HQ at 1024×1024. |
│ ├ bedrooms-256x256 | Generated using LSUN Bedroom at 256×256. |
│ ├ cars-512x384 | Generated using LSUN Car at 512×384. |
│ └ cats-256x256 | Generated using LSUN Cat at 256×256. |
├ videos | Example videos produced by our generator. |
│ └ high-quality-video-clips | Individual segments of the result video as high-quality MP4. |
├ ffhq-dataset | Raw data for the Flickr-Faces-HQ dataset. |
└ networks | Pre-trained networks as pickled instances of dnnlib.tflib.Network. |
├ stylegan-ffhq-1024x1024.pkl | StyleGAN trained with Flickr-Faces-HQ dataset at 1024×1024. |
├ stylegan-celebahq-1024x1024.pkl | StyleGAN trained with CelebA-HQ dataset at 1024×1024. |
├ stylegan-bedrooms-256x256.pkl | StyleGAN trained with LSUN Bedroom dataset at 256×256. |
├ stylegan-cars-512x384.pkl | StyleGAN trained with LSUN Car dataset at 512×384. |
├ stylegan-cats-256x256.pkl | StyleGAN trained with LSUN Cat dataset at 256×256. |
└ metrics | Auxiliary networks for the quality and disentanglement metrics. |
├ inception_v3_features.pkl | Standard Inception-v3 classifier that outputs a raw feature vector. |
├ vgg16_zhang_perceptual.pkl | Standard LPIPS metric to estimate perceptual similarity. |
├ celebahq-classifier-00-male.pkl | Binary classifier trained to detect a single attribute of CelebA-HQ. |
└ ⋯ | Please see the file listing for remaining networks. |
All material, excluding the Flickr-Faces-HQ dataset, is made available under Creative Commons BY-NC 4.0 license by NVIDIA Corporation. You can use, redistribute, and adapt the material for non-commercial purposes, as long as you give appropriate credit by citing our paper and indicating any changes that you've made.
For license information regarding the FFHQ dataset, please refer to the Flickr-Faces-HQ repository.
and inception_v3_softmax.pkl
are derived from the pre-trained Inception-v3 network by Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. The network was originally shared under Apache 2.0 license on the TensorFlow Models repository.
and vgg16_zhang_perceptual.pkl
are derived from the pre-trained VGG-16 network by Karen Simonyan and Andrew Zisserman. The network was originally shared under Creative Commons BY 4.0 license on the Very Deep Convolutional Networks for Large-Scale Visual Recognition project page.
is further derived from the pre-trained LPIPS weights by Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The weights were originally shared under BSD 2-Clause "Simplified" License on the PerceptualSimilarity repository.
- Both Linux and Windows are supported, but we strongly recommend Linux for performance and compatibility reasons.
- 64-bit Python 3.6 installation. We recommend Anaconda3 with numpy 1.14.3 or newer.
- TensorFlow 1.10.0 or newer with GPU support.
- One or more high-end NVIDIA GPUs with at least 11GB of DRAM. We recommend NVIDIA DGX-1 with 8 Tesla V100 GPUs.
- NVIDIA driver 391.35 or newer, CUDA toolkit 9.0 or newer, cuDNN 7.3.1 or newer.
A minimal example of using a pre-trained StyleGAN generator is given in When executed, the script downloads a pre-trained StyleGAN generator from Google Drive and uses it to generate an image:
> python
Downloading .... done
Gs Params OutputShape WeightShape
--- --- --- ---
latents_in - (?, 512) -
images_out - (?, 3, 1024, 1024) -
--- --- --- ---
Total 26219627
> ls results
example.png #
A more advanced example is given in The script reproduces the figures from our paper in order to illustrate style mixing, noise inputs, and truncation:
> python
results/figure02-uncurated-ffhq.png #
results/figure03-style-mixing.png #
results/figure04-noise-detail.png #
results/figure05-noise-components.png #
results/figure08-truncation-trick.png #
results/figure10-uncurated-bedrooms.png #
results/figure11-uncurated-cars.png #
results/figure12-uncurated-cats.png #
The pre-trained networks are stored as standard pickle files on Google Drive:
# Load pre-trained network.
url = '' # karras2019stylegan-ffhq-1024x1024.pkl
with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f:
_G, _D, Gs = pickle.load(f)
# _G = Instantaneous snapshot of the generator. Mainly useful for resuming a previous training run.
# _D = Instantaneous snapshot of the discriminator. Mainly useful for resuming a previous training run.
# Gs = Long-term average of the generator. Yields higher-quality results than the instantaneous snapshot.
The above code downloads the file and unpickles it to yield 3 instances of dnnlib.tflib.Network. To generate images, you will typically want to use Gs
– the other two networks are provided for completeness. In order for pickle.load()
to work, you will need to have the dnnlib
source directory in your PYTHONPATH and a tf.Session
set as default. The session can initialized by calling dnnlib.tflib.init_tf()
There are three ways to use the pre-trained generator:
for immediate-mode operation where the inputs and outputs are numpy arrays:# Pick latent vector. rnd = np.random.RandomState(5) latents = rnd.randn(1, Gs.input_shape[1]) # Generate image. fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True) images =, None, truncation_psi=0.7, randomize_noise=True, output_transform=fmt)
The first argument is a batch of latent vectors of shape
[num, 512]
. The second argument is reserved for class labels (not used by StyleGAN). The remaining keyword arguments are optional and can be used to further modify the operation (see below). The output is a batch of images, whose format is dictated by theoutput_transform
argument. -
to incorporate the generator as a part of a larger TensorFlow expression:latents = tf.random_normal([self.minibatch_per_gpu] + Gs_clone.input_shape[1:]) images = Gs_clone.get_output_for(latents, None, is_validation=True, randomize_noise=True) images = tflib.convert_images_to_uint8(images) result_expr.append(inception_clone.get_output_for(images))
The above code is from metrics/ It generates a batch of random images and feeds them directly to the Inception-v3 network without having to convert the data to numpy arrays in between.
Look up
to access individual sub-networks of the generator. Similar toGs
, the sub-networks are represented as independent instances of dnnlib.tflib.Network:src_latents = np.stack(np.random.RandomState(seed).randn(Gs.input_shape[1]) for seed in src_seeds) src_dlatents =, None) # [seed, layer, component] src_images =, randomize_noise=False, **synthesis_kwargs)
The above code is from It first transforms a batch of latent vectors into the intermediate W space using the mapping network and then turns these vectors into a batch of images using the synthesis network. The
array stores a separate copy of the same w vector for each layer of the synthesis network to facilitate style mixing.
The exact details of the generator are defined in training/ (see G_style
, G_mapping
, and G_synthesis
). The following keyword arguments can be specified to modify the behavior when calling run()
and get_output_for()
control the truncation trick that that is performed by default when usingGs
(ψ=0.7, cutoff=8). It can be disabled by settingtruncation_psi=1
, and the image quality can be further improved at the cost of variation by setting e.g.truncation_psi=0.5
. Note that truncation is always disabled when using the sub-networks directly. The average w needed to manually perform the truncation trick can be looked up usingGs.get_var('dlatent_avg')
. -
determines whether to use re-randomize the noise inputs for each generated image (True
, default) or whether to use specific noise values for the entire minibatch (False
). The specific values can be accessed via thetf.Variable
instances that are found using[var for name, var in Gs.components.synthesis.vars.items() if name.startswith('noise')]
. -
When using the mapping network directly, you can specify
to disable the automatic duplication ofdlatents
over the layers of the synthesis network. -
Runtime performance can be fine-tuned via
. The former disables support for progressive growing, which is not needed for a fully-trained generator, and the latter performs all computation using half-precision floating point arithmetic.
The training and evaluation scripts operate on datasets stored as multi-resolution TFRecords. Each dataset is represented by a directory containing the same image data in several resolutions to enable efficient streaming. There is a separate *.tfrecords file for each resolution, and if the dataset contains labels, they are stored in a separate file as well. By default, the scripts expect to find the datasets at datasets/<NAME>/<NAME>-<RESOLUTION>.tfrecords
. The directory can be changed by editing
result_dir = 'results'
data_dir = 'datasets'
cache_dir = 'cache'
To obtain the FFHQ dataset (datasets/ffhq
), please refer to the Flickr-Faces-HQ repository.
To obtain the CelebA-HQ dataset (datasets/celebahq
), please refer to the Progressive GAN repository.
To obtain other datasets, including LSUN, please consult their corresponding project pages. The datasets can be converted to multi-resolution TFRecords using the provided
> python create_lsun datasets/lsun-bedroom-full ~/lsun/bedroom_lmdb --resolution 256
> python create_lsun_wide datasets/lsun-car-512x384 ~/lsun/car_lmdb --width 512 --height 384
> python create_lsun datasets/lsun-cat-full ~/lsun/cat_lmdb --resolution 256
> python create_cifar10 datasets/cifar10 ~/cifar10
> python create_from_images datasets/custom-dataset ~/custom-images
Once the datasets are set up, you can train your own StyleGAN networks as follows:
- Edit to specify the dataset and training configuration by uncommenting or editing specific lines.
- Run the training script with
. - The results are written to a newly created directory
. - The training may take several days (or weeks) to complete, depending on the configuration.
By default,
is configured to train the highest-quality StyleGAN (configuration F in Table 1) for the FFHQ dataset at 1024×1024 resolution using 8 GPUs. Please note that we have used 8 GPUs in all of our experiments. Training with fewer GPUs may not produce identical results – if you wish to compare against our technique, we strongly recommend using the same number of GPUs.
Expected training time for 1024×1024 resolution using Tesla V100 GPUs:
GPUs | Training time |
1 | 5 weeks |
2 | 3 weeks |
4 | 2 weeks |
8 | 1 week |
The quality and disentanglement metrics used in our paper can be evaluated using By default, the script will evaluate the Fréchet Inception Distance (fid50k
) for the pre-trained FFHQ generator and write the results into a newly created directory under results
. The exact behavior can be changed by uncommenting or editing specific lines in
Expected evaluation time and results for the pre-trained FFHQ generator using one Tesla V100 GPU:
Metric | Time | Result | Description |
fid50k | 16 min | 4.4159 | Fréchet Inception Distance using 50,000 images. |
ppl_zfull | 55 min | 664.8854 | Perceptual Path Length for full paths in Z. |
ppl_wfull | 55 min | 233.3059 | Perceptual Path Length for full paths in W. |
ppl_zend | 55 min | 666.1057 | Perceptual Path Length for path endpoints in Z. |
ppl_wend | 55 min | 197.2266 | Perceptual Path Length for path endpoints in W. |
ls | 10 hours | z: 165.0106 w: 3.7447 |
Linear Separability in Z and W. |
Please note that the exact results may vary from run to run due to the non-deterministic nature of TensorFlow.
We thank Jaakko Lehtinen, David Luebke, and Tuomas Kynkäänniemi for in-depth discussions and helpful comments; Janne Hellsten, Tero Kuosmanen, and Pekka Jänis for compute infrastructure and help with the code release.