forked from CompVis/stable-diffusion
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
ablattmann
committed
Dec 21, 2021
1 parent
182dd36
commit e66308c
Showing
87 changed files
with
12,794 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,104 @@ | ||
# latent-diffusion | ||
High-Resolution Image Synthesis with Latent Diffusion Models | ||
# Latent Diffusion Models | ||
|
||
## Requirements | ||
A suitable [conda](https://conda.io/) environment named `ldm` can be created | ||
and activated with: | ||
|
||
``` | ||
conda env create -f environment.yaml | ||
conda activate ldm | ||
``` | ||
|
||
# Model Zoo | ||
|
||
## Pretrained Autoencoding Models | ||
![rec2](assets/reconstruction2.png) | ||
|
||
|
||
| Model | FID vs val | PSNR | PSIM | Link | Comments | ||
|-------------------------|------------|----------------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------| | ||
| f=4, VQ (Z=8192, d=3) | 0.58 | 27.43 +/- 4.26 | 0.53 +/- 0.21 | https://ommer-lab.com/files/latent-diffusion/vq-f4.zip | | | ||
| f=4, VQ (Z=8192, d=3) | 1.06 | 25.21 +/- 4.17 | 0.72 +/- 0.26 | https://heibox.uni-heidelberg.de/f/9c6681f64bb94338a069/?dl=1 | no attention | | ||
| f=8, VQ (Z=16384, d=4) | 1.14 | 23.07 +/- 3.99 | 1.17 +/- 0.36 | https://ommer-lab.com/files/latent-diffusion/vq-f8.zip | | | ||
| f=8, VQ (Z=256, d=4) | 1.49 | 22.35 +/- 3.81 | 1.26 +/- 0.37 | https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip | | ||
| f=16, VQ (Z=16384, d=8) | 5.15 | 20.83 +/- 3.61 | 1.73 +/- 0.43 | https://heibox.uni-heidelberg.de/f/0e42b04e2e904890a9b6/?dl=1 | | | ||
| | | | | | | | ||
| f=4, KL | 0.27 | 27.53 +/- 4.54 | 0.55 +/- 0.24 | https://ommer-lab.com/files/latent-diffusion/kl-f4.zip | | | ||
| f=8, KL | 0.90 | 24.19 +/- 4.19 | 1.02 +/- 0.35 | https://ommer-lab.com/files/latent-diffusion/kl-f8.zip | | | ||
| f=16, KL (d=16) | 0.87 | 24.08 +/- 4.22 | 1.07 +/- 0.36 | https://ommer-lab.com/files/latent-diffusion/kl-f16.zip | | | ||
| f=32, KL (d=64) | 2.04 | 22.27 +/- 3.93 | 1.41 +/- 0.40 | https://ommer-lab.com/files/latent-diffusion/kl-f32.zip | | | ||
|
||
### Get the models | ||
|
||
Running the following script downloads und extracts all available pretrained autoencoding models. | ||
|
||
```shell script | ||
bash scripts/download_first_stages.sh | ||
``` | ||
|
||
The first stage models can then be found in `models/first_stage_models/<model_spec>` | ||
|
||
## Pretrained LDMs | ||
| Datset | Task | Model | FID | IS | Prec | Recall | Link | Comments | ||
|---------------------------------|------|--------------|---------------|-----------------|------|------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------| | ||
| CelebA-HQ | Unconditional Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=0)| 5.11 (5.11) | 3.29 | 0.72 | 0.49 | https://ommer-lab.com/files/latent-diffusion/celeba.zip | | | ||
| FFHQ | Unconditional Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=1)| 4.98 (4.98) | 4.50 (4.50) | 0.73 | 0.50 | https://ommer-lab.com/files/latent-diffusion/ffhq.zip | | | ||
| LSUN-Churches | Unconditional Image Synthesis | LDM-KL-8 (400 DDIM steps, eta=0)| 4.02 (4.02) | 2.72 | 0.64 | 0.52 | https://ommer-lab.com/files/latent-diffusion/lsun_churches.zip | | | ||
| LSUN-Bedrooms | Unconditional Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=1)| 2.95 (3.0) | 2.22 (2.23)| 0.66 | 0.48 | https://ommer-lab.com/files/latent-diffusion/lsun_bedrooms.zip | | | ||
| ImageNet | Class-conditional Image Synthesis | LDM-VQ-8 (200 DDIM steps, eta=1) | 7.77(7.76)* /15.82** | 201.56(209.52)* /78.82** | 0.84* / 0.65** | 0.35* / 0.63** | https://ommer-lab.com/files/latent-diffusion/cin.zip | *: w/ guiding, classifier_scale 10 **: w/o guiding, scores in bracket calculated with script provided by [ADM](https://github.com/openai/guided-diffusion) | | ||
| Conceptual Captions | Text-conditional Image Synthesis | LDM-VQ-f4 (100 DDIM steps, eta=0) | 16.79 | 13.89 | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/text2img.zip | finetuned from LAION | | ||
| OpenImages | Super-resolution | N/A | N/A | N/A | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip | BSR image degradation | | ||
| OpenImages | Layout-to-Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=0) | 32.02 | 15.92 | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/layout2img_model.zip | | | ||
| Landscapes (finetuned 512) | Semantic Image Synthesis | LDM-VQ-4 (100 DDIM steps, eta=1) | N/A | N/A | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/semantic_synthesis.zip | | | ||
|
||
|
||
### Get the models | ||
|
||
The LDMs listed above can jointly be downloaded and extracted via | ||
|
||
```shell script | ||
bash scripts/download_models.sh | ||
``` | ||
|
||
The models can then be found in `models/ldm/<model_spec>`. | ||
|
||
### Sampling with unconditional models | ||
|
||
We provide a first script for sampling from our unconditional models. Start it via | ||
|
||
```shell script | ||
CUDA_VISIBLE_DEVICES=<GPU_ID> python scripts/sample_diffusion.py -r models/ldm/<model_spec>/model.ckpt -l <logdir> -n <\#samples> --batch_size <batch_size> -c <\#ddim steps> -e <\#eta> | ||
``` | ||
|
||
# Inpainting | ||
![inpainting](assets/inpainting.png) | ||
|
||
Download the pre-trained weights | ||
``` | ||
wget XXX | ||
``` | ||
|
||
and sample with | ||
``` | ||
python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results | ||
``` | ||
`indir` should contain images `*.png` and masks `<image_fname>_mask.png` like | ||
the examples provided in `data/inpainting_examples`. | ||
|
||
|
||
## Comin Soon... | ||
|
||
* Code for training LDMs and the corresponding compression models. | ||
* Inference scripts for conditional LDMs for various conditioning modalities. | ||
* In the meantime, you can play with our colab notebook https://colab.research.google.com/drive/1xqzUi2iXQXDqXBHQGP9Mqt2YrYW6cx-J?usp=sharing | ||
* We will also release some further pretrained models. | ||
## Comments | ||
|
||
- Our codebase for the diffusion models builds heavily on [OpenAI's codebase](https://github.com/openai/guided-diffusion) | ||
and [https://github.com/lucidrains/denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch). | ||
Thanks for open-sourcing! | ||
|
||
- The implementation of the transformer encoder is from [x-transformers](https://github.com/lucidrains/x-transformers) by [lucidrains](https://github.com/lucidrains?tab=repositories). | ||
|
||
|
||
|
||
...coming soon™ |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
model: | ||
base_learning_rate: 4.5e-6 | ||
target: ldm.models.autoencoder.AutoencoderKL | ||
params: | ||
monitor: "val/rec_loss" | ||
embed_dim: 16 | ||
lossconfig: | ||
target: ldm.modules.losses.LPIPSWithDiscriminator | ||
params: | ||
disc_start: 50001 | ||
kl_weight: 0.000001 | ||
disc_weight: 0.5 | ||
|
||
ddconfig: | ||
double_z: True | ||
z_channels: 16 | ||
resolution: 256 | ||
in_channels: 3 | ||
out_ch: 3 | ||
ch: 128 | ||
ch_mult: [ 1,1,2,2,4] # num_down = len(ch_mult)-1 | ||
num_res_blocks: 2 | ||
attn_resolutions: [16] | ||
dropout: 0.0 | ||
|
||
|
||
data: | ||
target: main.DataModuleFromConfig | ||
params: | ||
batch_size: 12 | ||
wrap: True | ||
train: | ||
target: ldm.data.imagenet.ImageNetSRTrain | ||
params: | ||
size: 256 | ||
degradation: pil_nearest | ||
validation: | ||
target: ldm.data.imagenet.ImageNetSRValidation | ||
params: | ||
size: 256 | ||
degradation: pil_nearest | ||
|
||
lightning: | ||
callbacks: | ||
image_logger: | ||
target: main.ImageLogger | ||
params: | ||
batch_frequency: 1000 | ||
max_images: 8 | ||
increase_log_steps: True | ||
|
||
trainer: | ||
benchmark: True | ||
accumulate_grad_batches: 2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
model: | ||
base_learning_rate: 4.5e-6 | ||
target: ldm.models.autoencoder.AutoencoderKL | ||
params: | ||
monitor: "val/rec_loss" | ||
embed_dim: 4 | ||
lossconfig: | ||
target: ldm.modules.losses.LPIPSWithDiscriminator | ||
params: | ||
disc_start: 50001 | ||
kl_weight: 0.000001 | ||
disc_weight: 0.5 | ||
|
||
ddconfig: | ||
double_z: True | ||
z_channels: 4 | ||
resolution: 256 | ||
in_channels: 3 | ||
out_ch: 3 | ||
ch: 128 | ||
ch_mult: [ 1,2,4,4 ] # num_down = len(ch_mult)-1 | ||
num_res_blocks: 2 | ||
attn_resolutions: [ ] | ||
dropout: 0.0 | ||
|
||
data: | ||
target: main.DataModuleFromConfig | ||
params: | ||
batch_size: 12 | ||
wrap: True | ||
train: | ||
target: ldm.data.imagenet.ImageNetSRTrain | ||
params: | ||
size: 256 | ||
degradation: pil_nearest | ||
validation: | ||
target: ldm.data.imagenet.ImageNetSRValidation | ||
params: | ||
size: 256 | ||
degradation: pil_nearest | ||
|
||
lightning: | ||
callbacks: | ||
image_logger: | ||
target: main.ImageLogger | ||
params: | ||
batch_frequency: 1000 | ||
max_images: 8 | ||
increase_log_steps: True | ||
|
||
trainer: | ||
benchmark: True | ||
accumulate_grad_batches: 2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
model: | ||
base_learning_rate: 4.5e-6 | ||
target: ldm.models.autoencoder.AutoencoderKL | ||
params: | ||
monitor: "val/rec_loss" | ||
embed_dim: 3 | ||
lossconfig: | ||
target: ldm.modules.losses.LPIPSWithDiscriminator | ||
params: | ||
disc_start: 50001 | ||
kl_weight: 0.000001 | ||
disc_weight: 0.5 | ||
|
||
ddconfig: | ||
double_z: True | ||
z_channels: 3 | ||
resolution: 256 | ||
in_channels: 3 | ||
out_ch: 3 | ||
ch: 128 | ||
ch_mult: [ 1,2,4 ] # num_down = len(ch_mult)-1 | ||
num_res_blocks: 2 | ||
attn_resolutions: [ ] | ||
dropout: 0.0 | ||
|
||
|
||
data: | ||
target: main.DataModuleFromConfig | ||
params: | ||
batch_size: 12 | ||
wrap: True | ||
train: | ||
target: ldm.data.imagenet.ImageNetSRTrain | ||
params: | ||
size: 256 | ||
degradation: pil_nearest | ||
validation: | ||
target: ldm.data.imagenet.ImageNetSRValidation | ||
params: | ||
size: 256 | ||
degradation: pil_nearest | ||
|
||
lightning: | ||
callbacks: | ||
image_logger: | ||
target: main.ImageLogger | ||
params: | ||
batch_frequency: 1000 | ||
max_images: 8 | ||
increase_log_steps: True | ||
|
||
trainer: | ||
benchmark: True | ||
accumulate_grad_batches: 2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
model: | ||
base_learning_rate: 4.5e-6 | ||
target: ldm.models.autoencoder.AutoencoderKL | ||
params: | ||
monitor: "val/rec_loss" | ||
embed_dim: 64 | ||
lossconfig: | ||
target: ldm.modules.losses.LPIPSWithDiscriminator | ||
params: | ||
disc_start: 50001 | ||
kl_weight: 0.000001 | ||
disc_weight: 0.5 | ||
|
||
ddconfig: | ||
double_z: True | ||
z_channels: 64 | ||
resolution: 256 | ||
in_channels: 3 | ||
out_ch: 3 | ||
ch: 128 | ||
ch_mult: [ 1,1,2,2,4,4] # num_down = len(ch_mult)-1 | ||
num_res_blocks: 2 | ||
attn_resolutions: [16,8] | ||
dropout: 0.0 | ||
|
||
data: | ||
target: main.DataModuleFromConfig | ||
params: | ||
batch_size: 12 | ||
wrap: True | ||
train: | ||
target: ldm.data.imagenet.ImageNetSRTrain | ||
params: | ||
size: 256 | ||
degradation: pil_nearest | ||
validation: | ||
target: ldm.data.imagenet.ImageNetSRValidation | ||
params: | ||
size: 256 | ||
degradation: pil_nearest | ||
|
||
lightning: | ||
callbacks: | ||
image_logger: | ||
target: main.ImageLogger | ||
params: | ||
batch_frequency: 1000 | ||
max_images: 8 | ||
increase_log_steps: True | ||
|
||
trainer: | ||
benchmark: True | ||
accumulate_grad_batches: 2 |
Oops, something went wrong.