Offical PyTorch implementation of our CVPR 2023 highlight paper "Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization".
TL;DR For vector-quantization (VQ) based autoregressive image generation, we propose a novel variable-length coding to replace existing fixed-length coding, which brings an accurate & compact code representation for images and a natural coarse-to-fine autoregressive generation order.
Our framework includes: (1) DynamicQuantization VAE (DQ-VAE) which encodes image regions into variable-length codes based on their information densities. (2) DQ-Transformer which thereby generates images autoregressively from coarse-grained (smooth regions with fewer codes) to fine-grained (details regions with more codes) by modeling the position and content of codes in each granularity alternately, through a novel stacked transformer architecture and shared-content, non-shared position input layers designs.
See Our Another CVPR2023 Work about Vector-Quantization based Image Generation "Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation" (GitHub)
Please run the following command to install the necessary dependencies.
conda env create -f environment.yml
Prepare dataset as follows, then change the corresponding datapath in data/default.py
.
Prepare ImageNet dataset structure as follows:
${Your Data Root Path}/ImageNet/
βββ train
β βββ n01440764
β | |ββ n01440764_10026.JPEG
β | |ββ n01440764_10027.JPEG
β | |ββ ...
β βββ n01443537
β | |ββ n01443537_2.JPEG
β | |ββ n01443537_16.JPEG
β | |ββ ...
β βββ ...
βββ val
β βββ n01440764
β | |ββ ILSVRC2012_val_00000293.JPEG
β | |ββ ILSVRC2012_val_00002138.JPEG
β | |ββ ...
β βββ n01443537
β | |ββ ILSVRC2012_val_00000236.JPEG
β | |ββ ILSVRC2012_val_00000262.JPEG
β | |ββ ...
β βββ ...
βββ imagenet_idx_to_synset.yml
βββ synset_human.txt
The FFHQ dataset could be obtained from the FFHQ repository. Then prepare the dataset structure as follows:
${Your Data Root Path}/FFHQ/
βββ assets
β βββ ffhqtrain.txt
β βββ ffhqvalidation.txt
βββ FFHQ
β βββ 00000.png
β βββ 00001.png
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --gpus -1 --base configs/stage1/dqvae-dual-r-05_imagenet.yml --max_epochs 50
The target ratio for the finer granularity (F=8) could be set in model.params.lossconfig.params.budget_loss_config.params.target_ratio
.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --gpus -1 --base configs/stage1/dqvae-triple-r-03-03_imagenet.yml --max_epochs 50
The target ratio for the finest granularity (F=8) could be set in model.params.lossconfig.params.budget_loss_config.params.target_fine_ratio
. The target ratio for the median granularity (F=16) could be set in model.params.lossconfig.params.budget_loss_config.params.target_median_ratio
.
Here we provide a better version of DQVAE compare with the one we proposed in the paper, which leads to much stable training results and also slight better reconstruction quality. To be specific, we assign the granularity of each region directly according to their image entropy instead of the features extracted from the encoder.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --gpus -1 --base configs/stage1/dqvae-entropy-dual-r05_imagenet.yml --max_epochs 50
The target ratio for the finer granularity (F=8) could be set in model.params.encoderconfig.params.router_config.params.fine_grain_ratito
. The distribution of image entropy is pre-calculated in scripts/tools/thresholds/entropy_thresholds_imagenet_train_patch-16.json
.
Thanks for @1e0nhardt 's advice and we update the image entropy calculation in models/stage1_dynamic/dqvae_dual_entropy.py
.
Copy the first stage model DQ-VAE's config to model.params.first_stage_config
. The pre-trained DQ-VAE's path should be set in model.params.first_stage_config.params.ckpt_path
. Here we take ImageNet as an example to show the unconditional DQ-Transformer training, and other datasets like FFHQ could be derive correspondingly.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --gpus -1 --base configs/stage2/uncond_imagenet_p6c18.yml --max_epochs 100
NOTE: some important hyper-parameters in the config file:
- the layer of Content-Transformer:
model.params.transformer_config.params.content_layer
- the layer of Position-Transformer:
model.params.transformer_config.params.position_layer
- the vocab size of content code:
model.params.transformer_config.params.vocab_size
, which should include the codebook size of DQ-VAE's codebook, 1 extra pad code, 1 extra eos code and 1 extra sos code. - the vocab size of coarse granularity's position:
model.params.transformer_config.params.coarse_position_size
, which should include the size of coarse granularity's feature map (e.g., 16$\times$ 16 = 256 for downsampling factor F=16, or 32$\times$ 32 = 1024 for downsampling factor F=8), 1 extra pad code, 1 extra eos code and 1 extra sos code. - the vocab size of fine granularity's position:
model.params.transformer_config.params.fine_position_size
Copy the first stage model DQ-VAE's config to model.params.first_stage_config
. The pre-trained DQ-VAE's path should be set in model.params.first_stage_config.params.ckpt_path
. The class-conditional training for DQ-Transformer:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --gpus -1 --base configs/stage2/class_imagenet_p6c18.yml --max_epochs 100
NOTE: some important hyper-parameters in the config file:
- the vocab size of content code:
model.params.transformer_config.params.vocab_size
, which should include the codebook size of DQ-VAE's codebook, 1 extra pad code, 1 extra eos code and 1000 imagenet class number. - the vocab size of coarse granularity's position:
model.params.transformer_config.params.coarse_position_size
, which should include the size of coarse granularity's feature map (e.g., 16$\times$ 16 = 256 for downsampling factor F=16, or 32$\times$ 32 = 1024 for downsampling factor F=8), 1 extra pad code, 1 extra eos code and 1000 imagenet class number. - the vocab size of fine granularity's position:
model.params.transformer_config.params.fine_position_size
description | Training Details | Dataset | FID (val, 50k) | download link |
---|---|---|---|---|
DQ-VAE, dual granularity ( |
4 A100, 10 epochs | ImageNet | 1.6968 | Google Cloud |
If you found this code useful, please cite the following paper:
@InProceedings{Huang_2023_CVPR,
author = {Huang, Mengqi and Mao, Zhendong and Chen, Zhuowei and Zhang, Yongdong},
title = {Towards Accurate Image Coding: Improved Autoregressive Image Generation With Dynamic Vector Quantization},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {22596-22605}
}
@InProceedings{Huang_2023_CVPR,
author = {Huang, Mengqi and Mao, Zhendong and Wang, Quan and Zhang, Yongdong},
title = {Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {2002-2011}
}