Skip to content

Commit

Permalink
Adds a blog post discussing simple optimization techniques for SDXL i…
Browse files Browse the repository at this point in the history
…n 🧨 diffusers (#1603)

* blog

* minor edits.

* add: thumbnail

* metadata

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <[email protected]>

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <[email protected]>

* clarify it's the unet that i larger.

* make the important blocks pop out.

* change to Model memory footprint

* reword slightly./

* make some entries in the table bold.

* put madebyollin within backquotes.

* add a sentence clarifying inference latency computations.

* punctuation

---------

Co-authored-by: Pedro Cuenca <[email protected]>
  • Loading branch information
sayakpaul and pcuenca authored Oct 24, 2023
1 parent 6637550 commit 463c064
Show file tree
Hide file tree
Showing 3 changed files with 251 additions and 0 deletions.
10 changes: 10 additions & 0 deletions _blog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2980,6 +2980,16 @@
- open-source
- serverless

- local: simple_sdxl_optimizations
title: "Exploring simple optimizations for SDXL"
author: sayakpaul
thumbnail: /blog/assets/simple_sdxl_optimizations/thumbnail.png
date: October 24, 2023
tags:
- diffusers
- guide
- sdxl

- local: the_n_implementation_details_of_rlhf_with_ppo
title: "The N Implementation Details of RLHF with PPO"
author: vwxyzjn
Expand Down
Binary file added assets/simple_sdxl_optimizations/thumbnail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
241 changes: 241 additions & 0 deletions simple_sdxl_optimizations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
---
title: "Exploring simple optimizations for SDXL"
thumbnail: /blog/assets/simple_sdxl_optimizations/thumbnail.png
authors:
- user: sayakpaul
- user: stevhliu
---

# Exploring simple optimizations for SDXL

<a target="_blank" href="https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/exploring_simple%20optimizations_for_sdxl.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

[Stable Diffusion XL (SDXL)](https://huggingface.co/papers/2307.01952) is the latest latent diffusion model by Stability AI for generating high-quality super realistic images. It overcomes challenges of previous Stable Diffusion models like getting hands and text right as well as spatially correct compositions. In addition, SDXL is also more context aware and requires fewer words in its prompt to generate better looking images.

However, all of these improvements come at the expense of a significantly larger model. How much larger? The base SDXL model has 3.5B parameters (the UNet, in particular), which is approximately 3x larger than the previous Stable Diffusion model.

To explore how we can optimize SDXL for inference speed and memory use, we ran some tests on an A100 GPU (40 GB). For each inference run, we generate 4 images and repeat it 3 times. While computing the inference latency, we only consider the final iteration out of the 3 iterations.

So if you run SDXL out-of-the-box as is with full precision and use the default attention mechanism, it’ll consume 28GB of memory and take 72.2 seconds!

```python
from diffusers import StableDiffusionXLPipeline

pipeline = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0").to("cuda")
pipeline.unet.set_default_attn_processor()
```

This isn’t very practical and can slow you down because you’re often generating more than 4 images. And if you don’t have a more powerful GPU, you’ll run into that frustrating out-of-memory error message. So how can we optimize SDXL to increase inference speed and reduce its memory-usage?

In 🤗 Diffusers, we have a bunch of optimization tricks and techniques to help you run memory-intensive models like SDXL and we'll show you how! The two things we’ll focus on are *inference speed* and *memory*.

<div style="background-color: #e6f9e6; padding: 16px 32px; outline: 2px solid; border-radius: 5px;">
🧠 The techniques discussed in this post are applicable to all the <a href=https://huggingface.co/docs/diffusers/main/en/using-diffusers/pipeline_overview>pipelines</a>.
</div>

## Inference speed

Diffusion is a random process, so there's no guarantee you'll get an image you’ll like. Often times, you’ll need to run inference multiple times and iterate, and that’s why optimizing for speed is crucial. This section focuses on using lower precision weights and incorporating memory-efficient attention and `torch.compile` from PyTorch 2.0 to boost speed and reduce inference time.

### Lower precision

Model weights are stored at a certain *precision* which is expressed as a floating point data type. The standard floating point data type is float32 (fp32), which can accurately represent a wide range of floating numbers. For inference, you often don’t need to be as precise so you should use float16 (fp16) which captures a narrower range of floating numbers. This means fp16 only takes half the amount of memory to store compared to fp32, and is twice as fast because it is easier to calculate. In addition, modern GPU cards have optimized hardware to run fp16 calculations, making it even faster.

With 🤗 Diffusers, you can use fp16 for inference by specifying the `torch.dtype` parameter to convert the weights when the model is loaded:

```python
from diffusers import StableDiffusionXLPipeline

pipeline = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
).to("cuda")
pipeline.unet.set_default_attn_processor()
```

Compared to a completely unoptimized SDXL pipeline, using fp16 takes 21.7GB of memory and only 14.8 seconds. You’re almost speeding up inference by a full minute!

### Memory-efficient attention

The attention blocks used in transformers modules can be a huge bottleneck, because memory increases _quadratically_ as input sequences get longer. This can quickly take up a ton of memory and leave you with an out-of-memory error message. 😬

Memory-efficient attention algorithms seek to reduce the memory burden of calculating attention, whether it is by exploiting sparsity or tiling. These optimized algorithms used to be mostly available as third-party libraries that needed to be installed separately. But starting with PyTorch 2.0, this is no longer the case. PyTorch 2 introduced [scaled dot product attention (SDPA)](https://pytorch.org/blog/accelerated-diffusers-pt-20/), which offers fused implementations of [Flash Attention](https://huggingface.co/papers/2205.14135), [memory-efficient attention](https://huggingface.co/papers/2112.05682) (xFormers), and a PyTorch implementation in C++. SDPA is probably the easiest way to speed up inference: if you’re using PyTorch ≥ 2.0 with 🤗 Diffusers, it is automatically enabled by default!

```python
from diffusers import StableDiffusionXLPipeline

pipeline = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")
```

Compared to a completely unoptimized SDXL pipeline, using fp16 and SDPA takes the same amount of memory and the inference time improves to 11.4 seconds. Let’s use this as the new baseline we’ll compare the other optimizations to.

### torch.compile

PyTorch 2.0 also introduced the `torch.compile` API for just-in-time (JIT) compilation of your PyTorch code into more optimized kernels for inference. Unlike other compiler solutions, `torch.compile` requires minimal changes to your existing code and it is as easy as wrapping your model with the function.

With the `mode` parameter, you can optimize for memory overhead or inference speed during compilation, which gives you way more flexibility.

```python
from diffusers import StableDiffusionXLPipeline

pipeline = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")
pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)
```

Compared to the previous baseline (fp16 + SDPA), wrapping the UNet with `torch.compile` improves inference time to 10.2 seconds.

<div style="background-color: #e6f9e6; padding: 16px 32px; outline: 2px solid; border-radius: 5px;">
⚠️ The first time you compile a model is slower, but once the model is compiled, all subsequent calls to it are much faster!
</div>

## Model memory footprint

Models today are growing larger and larger, making it a challenge to fit them into memory. This section focuses on how you can reduce the memory footprint of these enormous models so you can run them on consumer GPUs. These techniques include CPU offloading, decoding latents into images over several steps rather than all at once, and using a distilled version of the autoencoder.

### Model CPU offloading

Model offloading saves memory by loading the UNet into the GPU memory while the other components of the diffusion model (text encoders, VAE) are loaded onto the CPU. This way, the UNet can run for multiple iterations on the GPU until it is no longer needed.

```python
from diffusers import StableDiffusionXLPipeline

pipeline = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()
```

Compared to the baseline, it now takes 20.2GB of memory which saves you 1.5GB of memory.

### Sequential CPU offloading

Another type of offloading which can save you more memory at the expense of slower inference is sequential CPU offloading. Rather than offloading an entire model - like the UNet - model weights stored in different UNet submodules are offloaded to the CPU and only loaded onto the GPU right before the forward pass. Essentially, you’re only loading parts of the model each time which allows you to save even more memory. The only downside is that it is significantly slower because you’re loading and offloading submodules many times.

```python
from diffusers import StableDiffusionXLPipeline

pipeline = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
)
pipeline.enable_sequential_cpu_offload()
```

Compared to the baseline, this takes 19.9GB of memory but the inference time increases to 67 seconds.

## Slicing

In SDXL, a variational encoder (VAE) decodes the refined latents (predicted by the UNet) into realistic images. The memory requirement of this step scales with the number of images being predicted (the batch size). Depending on the image resolution and the available GPU VRAM, it can be quite memory-intensive.

This is where “slicing” is useful. The input tensor to be decoded is split into slices and the computation to decode it is completed over several steps. This saves memory and allows larger batch sizes.

```python
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
pipe.enable_vae_slicing()
```

With sliced computations, we reduce the memory to 15.4GB. If we add sequential CPU offloading, it is further reduced to 11.45GB which lets you generate 4 images (1024x1024) per prompt. However, with sequential offloading, the inference latency also increases.

## Caching computations

Any text-conditioned image generation model typically uses a text encoder to compute embeddings from the input prompt. SDXL uses *two* text encoders! This contributes quite a bit to the inference latency. However, since these embeddings remain unchanged throughout the reverse diffusion process, we can precompute them and reuse them as we go. This way, after computing the text embeddings, we can remove the text encoders from memory.

First, load the text encoders and their corresponding tokenizers and compute the embeddings from the input prompt:

```python
tokenizers = [tokenizer, tokenizer_2]
text_encoders = [text_encoder, text_encoder_2]

(
prompt_embeds,
negative_prompt_embeds,
pooled_prompt_embeds,
negative_pooled_prompt_embeds
) = encode_prompt(tokenizers, text_encoders, prompt)
```

Next, flush the GPU memory to remove the text encoders:

```jsx
del text_encoder, text_encoder_2, tokenizer, tokenizer_2
flush()
```

Now the embeddings are good to go straight to the SDXL pipeline:

```python
from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
text_encoder=None,
text_encoder_2=None,
tokenizer=None,
tokenizer_2=None,
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

call_args = dict(
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
num_images_per_prompt=num_images_per_prompt,
num_inference_steps=num_inference_steps,
)
image = pipe(**call_args).images[0]
```

Combined with SDPA and fp16, we can reduce the memory to 21.9GB. Other techniques discussed above for optimizing memory can also be used with cached computations.

## Tiny Autoencoder

As previously mentioned, a VAE decodes latents into images. Naturally, this step is directly bottlenecked by the size of the VAE. So, let’s just use a smaller autoencoder! The [Tiny Autoencoder by `madebyollin`](https://github.com/madebyollin/taesd), available [the Hub](https://huggingface.co/madebyollin/taesdxl) is just 10MB and it is distilled from the original VAE used by SDXL.

```python
from diffusers import AutoencoderTiny

pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
)
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesdxl", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
```

With this setup, we reduce the memory requirement to 15.6GB while reducing the inference latency at the same time.

<div style="background-color: #e6f9e6; padding: 16px 32px; outline: 2px solid; border-radius: 5px;">
⚠️ The Tiny Autoencoder can omit some of the more fine-grained details from images, which is why the Tiny AutoEncoder is more appropriate for image previews.
</div>

## Conclusion

To conclude and summarize the savings from our optimizations:

<div style="background-color: #e6f9e6; padding: 16px 32px; outline: 2px solid; border-radius: 5px;">
⚠️ While profiling GPUs to measure the trade-off between inference latency and memory requirements, it is important to be aware of the hardware being used. The above findings may not translate equally from hardware to hardware. For example, `torch.compile` only seems to benefit modern GPUs, at least for SDXL.
</div>

| **Technique** | **Memory (GB)** | **Inference latency (ms)** |
| --- | --- | --- |
| unoptimized pipeline | 28.09 | 72200.5 |
| fp16 | 21.72 | 14800.9 |
| **fp16 + SDPA (default)** | **21.72** | **11413.0** |
| default + `torch.compile` | 21.73 | 10296.7 |
| default + model CPU offload | 20.21 | 16082.2 |
| default + sequential CPU offload | 19.91 | 67034.0 |
| **default + VAE slicing** | **15.40** | **11232.2** |
| default + VAE slicing + sequential CPU offload | 11.47 | 66869.2 |
| default + precomputed text embeddings | 21.85 | 11909.0 |
| default + Tiny Autoencoder | 15.48 | 10449.7 |

We hope these optimizations make it a breeze to run your favorite pipelines. Try these techniques out and share your images with us! 🤗

0 comments on commit 463c064

Please sign in to comment.