Skip to content

Commit

Permalink
Support StableVideoPipeline (#80)
Browse files Browse the repository at this point in the history
* demonstrate the ability to quantize models

* bump version to 0.0.15.post1 & refactor cutlass code & build for torch 2.1.2

* bump version to 1.0.0 and support StableVideoDiffusionPipeline

* fix README.md
  • Loading branch information
chengzeyi authored Dec 19, 2023
1 parent b6cbf9b commit 63ab2a0
Show file tree
Hide file tree
Showing 21 changed files with 804 additions and 475 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ jobs:
- "3.10"
- "3.11"
torch_version:
- "2.1.0"
- "2.1.1"
- "2.1.2"
cuda_short_version:
- "118"
- "121"
Expand Down
41 changes: 25 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,20 @@

[Discord Channel](https://discord.gg/kQFvfzM4SJ)

`stable-fast` achieves SOTA inference performance on __ALL__ kinds of diffuser models.
`stable-fast` achieves SOTA inference performance on __ALL__ kinds of diffuser models, even with the latest `StableVideoDiffusionPipeline`.
And unlike `TensorRT` or `AITemplate`, which takes dozens of minutes to compile a model, `stable-fast` only takes a few seconds to compile a model.
`stable-fast` also supports `dynamic shape`, `LoRA` and `ControlNet` out of the box.

[![](https://mermaid.ink/img/pako:eNpFUk2PmzAQ_SsjS1EuQGxYEuBQqdL20EMvuzlUXe9hgAEsgY2w2U2E-O81odoePJp5Hs1787GwytTECnY4LEorV8BydB0NdCzg2JiJrDuusB4OUt_uVYeTC0tyKLVTrieQ7KoGAp8ILWma0CndgtEEasCW4FO5Dl6fQURpAKmIb_55h4N1NNrAZ8LL9Tc88YxL5ilCvCkLb5J9oFZ9j-DMVHWSBZ7p4UaVGUbV0w59_3mlYezR_Ys9cdObzz24krZmerlCHvFI7Jh1WPYUNmidZO9S33dCyQbPpixVRtdWMuAQht8g5pxLLXWJE7yJLL8EIFLBvRVp5i3nSQB5LjaTvrOADTQNqGo_zkVq2DRvo5Ss8G6JlrYeV5-HszOvd12xwk0zBWwea9_Ds8J2woEVDfb2C_1RK9_5F9gbrMmHC3P3cVtcq6zzJb3yRrUbPk-9hzvnRlucTtt31Po1zOU2u5NV9bbF7iM_n87xOcM4ofMlwTRJ6qoUedbET6KpL1zEyNY1YCPqP8b8V0UPPb_2q3kcz_oX1ja5bQ?type=png)](https://mermaid.live/edit#pako:eNpFUk2PmzAQ_SsjS1EuQGxYEuBQqdL20EMvuzlUXe9hgAEsgY2w2U2E-O81odoePJp5Hs1787GwytTECnY4LEorV8BydB0NdCzg2JiJrDuusB4OUt_uVYeTC0tyKLVTrieQ7KoGAp8ILWma0CndgtEEasCW4FO5Dl6fQURpAKmIb_55h4N1NNrAZ8LL9Tc88YxL5ilCvCkLb5J9oFZ9j-DMVHWSBZ7p4UaVGUbV0w59_3mlYezR_Ys9cdObzz24krZmerlCHvFI7Jh1WPYUNmidZO9S33dCyQbPpixVRtdWMuAQht8g5pxLLXWJE7yJLL8EIFLBvRVp5i3nSQB5LjaTvrOADTQNqGo_zkVq2DRvo5Ss8G6JlrYeV5-HszOvd12xwk0zBWwea9_Ds8J2woEVDfb2C_1RK9_5F9gbrMmHC3P3cVtcq6zzJb3yRrUbPk-9hzvnRlucTtt31Po1zOU2u5NV9bbF7iM_n87xOcM4ofMlwTRJ6qoUedbET6KpL1zEyNY1YCPqP8b8V0UPPb_2q3kcz_oX1ja5bQ)
[![](https://mermaid.ink/img/pako:eNpVUsFu2zAM_RVCQJCLnchOncQ-DBjWHXbYpc1hWNUDbdO2AFsyLKVNFvjfR8VFhwkwQT5LfOQjb6KyNYlCxHGsTGVNo9tCGeBzuX7rcPIfUTjvuvZdAUcp_2Ed6bbzBaQZg_ckq9VNG83Qbe07GmhdwLqxEzm_nmFerZS5XKuQOS7JI3R20n-s8dgr47XvCZQ46YGA30BLhib02rRgDYEesCUuw3fw_AjJJosgS9ILf-xIcJ5GF_FNeDr9ggd5lEowW4wX7eBFiTc0uu8RvJ2qTomIme7uprLDqHtaoK8_TjSMPfqPmImb3r4vwYmMs9PTCfKN3CQL5jyWPcUNOq_EqzLXhVCJgdm0I1a1dkqAhDj-AqkM8ilT4gQvyTE_RJBkiWSbZEe2Uu4iyPMkmOxVRGKgaUBd84xuQXauOaiqRMFuiY5CjzPfw7O3z1dTicJPZ4rEeay5h0eN7YTD_-D3WnPjomiwdwz2Fmvi8Cb8dQzL0GrnOeOyDgE_Tz3DnfejK7bb8HvT8hTOZZBu63Qd5tm95fvtPt0fMd3R_rDDbLerqzLJj036kDT1QSYpinmOxIjmt7XDZwF0r-fnson3hZz_AskE0h8?type=png)](https://mermaid.live/edit#pako:eNpVUsFu2zAM_RVCQJCLnchOncQ-DBjWHXbYpc1hWNUDbdO2AFsyLKVNFvjfR8VFhwkwQT5LfOQjb6KyNYlCxHGsTGVNo9tCGeBzuX7rcPIfUTjvuvZdAUcp_2Ed6bbzBaQZg_ckq9VNG83Qbe07GmhdwLqxEzm_nmFerZS5XKuQOS7JI3R20n-s8dgr47XvCZQ46YGA30BLhib02rRgDYEesCUuw3fw_AjJJosgS9ILf-xIcJ5GF_FNeDr9ggd5lEowW4wX7eBFiTc0uu8RvJ2qTomIme7uprLDqHtaoK8_TjSMPfqPmImb3r4vwYmMs9PTCfKN3CQL5jyWPcUNOq_EqzLXhVCJgdm0I1a1dkqAhDj-AqkM8ilT4gQvyTE_RJBkiWSbZEe2Uu4iyPMkmOxVRGKgaUBd84xuQXauOaiqRMFuiY5CjzPfw7O3z1dTicJPZ4rEeay5h0eN7YTD_-D3WnPjomiwdwz2Fmvi8Cb8dQzL0GrnOeOyDgE_Tz3DnfejK7bb8HvT8hTOZZBu63Qd5tm95fvtPt0fMd3R_rDDbLerqzLJj036kDT1QSYpinmOxIjmt7XDZwF0r-fnson3hZz_AskE0h8)

| Framework | torch | torch.compile | AIT | oneflow | TensorRT | __stable-fast__ |
| --------- | ----- | ------------- | ---- | ------- | -------- | --------------- |
| Time/ms | 1897 | 1510 | 1158 | 1003 | 991 | __995__ |
[![](https://mermaid.ink/img/pako:eNpFUk1v2zAM_SuEgCAXu7GdL8eHXdYdd1mDoVjVA23RtgBbMiwmSxb4v49OipYn6VHie3zkTVXekCpUHMfaVd7Vtim0A4nL9XuLI3_c5vhrDbcF5EnyhbVkm5YLSLcC3ossFjfrrEC3JbfU07KAZe1HCrycYFostLtcq7lyXBIjtH60_7xj7LRjyx2BVkfbE8gfaMjRiGxdA94RnK0hLzK4hZffz_HrMYI0yTaXbZpFkG2hHrGnEMlb-HV8hU2SJ1oJX4wXG-BNqzM623UI7Meq1SoSrvvxqfL9YDt6QIGx7CiuMbBW79pdH_8lQeKQCVpBAnH8Tcilae1KHOHtkEewTyLY5O8qUj2NPVojxt5mr4RmtkKrQo4lBpplTfIOT-xfrq5SBY8nitRpMMj0bLGRVlRRYxc-0R_GithPsPNoSK43xddhHmFjA0vJxxBn_DR2ArfMQyhWqzn91Ih3p3JudxWsmafQng-71S7b5Zitabdf43a9NlWZHvI626S12SdphmqaIjWg--P9lyq66_n52J_7Gk3_AVazvkQ?type=png)](https://mermaid.live/edit#pako:eNpFUk1v2zAM_SuEgCAXu7GdL8eHXdYdd1mDoVjVA23RtgBbMiwmSxb4v49OipYn6VHie3zkTVXekCpUHMfaVd7Vtim0A4nL9XuLI3_c5vhrDbcF5EnyhbVkm5YLSLcC3ossFjfrrEC3JbfU07KAZe1HCrycYFostLtcq7lyXBIjtH60_7xj7LRjyx2BVkfbE8gfaMjRiGxdA94RnK0hLzK4hZffz_HrMYI0yTaXbZpFkG2hHrGnEMlb-HV8hU2SJ1oJX4wXG-BNqzM623UI7Meq1SoSrvvxqfL9YDt6QIGx7CiuMbBW79pdH_8lQeKQCVpBAnH8Tcilae1KHOHtkEewTyLY5O8qUj2NPVojxt5mr4RmtkKrQo4lBpplTfIOT-xfrq5SBY8nitRpMMj0bLGRVlRRYxc-0R_GithPsPNoSK43xddhHmFjA0vJxxBn_DR2ArfMQyhWqzn91Ih3p3JudxWsmafQng-71S7b5Zitabdf43a9NlWZHvI626S12SdphmqaIjWg--P9lyq66_n52J_7Gk3_AVazvkQ)

__NOTE__: During benchmarking, `TensorRT` is tested with `static batch size` and `CUDA Graph enabled` while `stable-fast` is running with full dynamic shape.
| Model | torch | torch.compile | AIT | oneflow | TensorRT | __stable-fast__ |
| ----------- | ----- | ------------- | ---- | ------- | -------- | --------------- |
| SD 1.5 (ms) | 1897 | 1510 | 1158 | 1003 | 991 | __995__ |
| SVD-XT (s) | 98 | 70 | | | | __48__ |

__NOTE__: During benchmarking, `TensorRT` is tested with `static batch size` and `CUDA Graph enabled` while `stable-fast` is running with dynamic shape.

- [🚀Stable Fast](#stable-fast)
- [Introduction](#introduction)
Expand All @@ -28,6 +31,7 @@ __NOTE__: During benchmarking, `TensorRT` is tested with `static batch size` and
- [Usage](#usage)
- [Optimize StableDiffusionPipeline](#optimize-stablediffusionpipeline)
- [Optimize LCM Pipeline](#optimize-lcm-pipeline)
- [Optimize StableVideoDiffusionPipeline](#optimize-stablevideodiffusionpipeline)
- [Dynamically Switch LoRA](#dynamically-switch-lora)
- [Model Quantization](#model-quantization)
- [Some Common Methods To Speed Up PyTorch](#some-common-methods-to-speed-up-pytorch)
Expand All @@ -40,8 +44,6 @@ __NOTE__: During benchmarking, `TensorRT` is tested with `static batch size` and

## Introduction

__NOTE__: `stable-fast` is currently only in beta stage and is prone to be buggy, feel free to try it out and give suggestions!

### What is this?

`stable-fast` is an ultra lightweight inference optimization framework for __HuggingFace Diffusers__ on __NVIDIA GPUs__.
Expand All @@ -51,7 +53,7 @@ __NOTE__: `stable-fast` is currently only in beta stage and is prone to be buggy
- __Low Precision & Fused GEMM__: `stable-fast` implements a series of fused GEMM operators that compute with `fp16` precision, which is fast than PyTorch's defaults (read & write with `fp16` while compute with `fp32`).
- __Fused Linear GEGLU__: `stable-fast` is able to fuse `GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c)` into one CUDA kernel.
- __NHWC & Fused GroupNorm__: `stable-fast` implements a highly optimized fused NHWC `GroupNorm + Silu` operator with OpenAI's `Triton`, which eliminates the need of memory format permutation operators.
- __Fully Traced Model__: `stable-fast` improves the `torch.jit.trace` interface to make it more proper for tracing complex models. Nearly every part of `StableDiffusionPipeline` can be traced and converted to __TorchScript__. It is more stable than `torch.compile` and has a significantly lower CPU overhead than `torch.compile` and supports __ControlNet__ and __LoRA__.
- __Fully Traced Model__: `stable-fast` improves the `torch.jit.trace` interface to make it more proper for tracing complex models. Nearly every part of `StableDiffusionPipeline/StableVideoDiffusionPipeline` can be traced and converted to __TorchScript__. It is more stable than `torch.compile` and has a significantly lower CPU overhead than `torch.compile` and supports __ControlNet__ and __LoRA__.
- __CUDA Graph__: `stable-fast` can capture the `UNet`, `VAE` and `TextEncoder` into CUDA Graph format, which can reduce the CPU overhead when the batch size is small. This implemention also supports dynamic shape.
- __Fused Multihead Attention__: `stable-fast` just uses xformers and makes it compatible with __TorchScript__.

Expand All @@ -64,14 +66,14 @@ But I still need to do some work to make it more stable and easy to use and prov

- __Fast__: `stable-fast` is specialy optimized for __HuggingFace Diffusers__. It achieves a high performance across many libraries. And it provides a very fast compilation speed within only a few seconds. It is significantly faster than `torch.compile`, `TensorRT` and `AITemplate` in compilation time.
- __Minimal__: `stable-fast` works as a plugin framework for `PyTorch`. It utilizes existing `PyTorch` functionality and infrastructures and is compatible with other acceleration techniques, as well as popular fine-tuning techniques and deployment solutions.
- __Maximum Compatibility__: `stable-fast` is compatible with all kinds of `HuggingFace Diffusers` and `PyTorch` versions. It is also compatible with `ControlNet` and `LoRA`.
- __Maximum Compatibility__: `stable-fast` is compatible with all kinds of `HuggingFace Diffusers` and `PyTorch` versions. It is also compatible with `ControlNet` and `LoRA`. And it even supports the latest `StableVideoDiffusionPipeline` out of the box!

## Installation

__NOTE__: `stable-fast` is currently only tested on `Linux` and `WSL2 in Windows`.
You need to install PyTorch with CUDA support at first (versions from 1.12 to 2.1 are suggested).

I only test `stable-fast` with `torch==2.1.0`, `xformers==0.0.22` and `triton==2.1.0` on `CUDA 12.1` and `Python 3.10`.
I only test `stable-fast` with `torch>=2.1.0`, `xformers>=0.0.22` and `triton>=2.1.0` on `CUDA 12.1` and `Python 3.10`.
Other versions might build and run successfully but that's not guaranteed.

### Install Prebuilt Wheels
Expand All @@ -96,7 +98,7 @@ pip3 install --index-url https://download.pytorch.org/whl/cu121 'diffusers>=0.19
# Install PyTorch with CUDA and other packages at first.
# Windows user: Triton might be not available, you could skip it.
# NOTE: 'wheel' is required or you will meet `No module named 'torch'` error when building.
pip3 install wheel 'torch>=1.12.0' 'diffusers>=0.19.3' 'xformers>=0.0.20' 'triton>=2.1.0'
pip3 install wheel 'torch>=2.1.0' 'xformers>=0.0.22' 'triton>=2.1.0' 'diffusers>=0.19.3'

# (Optional) Makes the build much faster.
pip3 install ninja
Expand Down Expand Up @@ -124,9 +126,8 @@ import time
import torch
from diffusers import (StableDiffusionPipeline,
EulerAncestralDiscreteScheduler)
from sfast.compilers.stable_diffusion_pipeline_compiler import (
compile, CompilationConfig)

from sfast.compilers.diffusion_pipeline_compiler import (compile,
CompilationConfig)

def load_model():
model = StableDiffusionPipeline.from_pretrained(
Expand All @@ -139,7 +140,6 @@ def load_model():
model.to(torch.device('cuda'))
return model


model = load_model()

config = CompilationConfig.Default()
Expand All @@ -155,6 +155,8 @@ try:
except ImportError:
print('Triton not installed, skip')
# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
# But it can increase the amount of GPU memory used.
# For StableVideoDiffusionPipeline it is not needed.
config.enable_cuda_graph = True

model = compile(model, config)
Expand Down Expand Up @@ -197,6 +199,12 @@ You can check this Colab to see how it works on T4 GPU: [![Open In Colab](https:
Refer to [examples/optimize_lcm_pipeline.py](examples/optimize_lcm_lora.py) for more details about how to optimize normal SD model with LCM LoRA.
Refer to [examples/optimize_lcm_pipeline.py](examples/optimize_lcm_pipeline.py) for more details about how to optimize the standalone LCM model.

### Optimize StableVideoDiffusionPipeline

`stable-fast` is able to optimize the newest `StableVideoDiffusionPipeline` and achieve a `2x` speedup

Refer to [examples/optimize_stable_video_diffusion_pipeline.py](examples/optimize_stable_video_diffusion_pipeline.py) for more details

### Dynamically Switch LoRA

Switching LoRA dynamically is supported but you need to do some extra work.
Expand Down Expand Up @@ -333,6 +341,7 @@ Thanks for __@SuperSecureHuman__ and __@jon-chuang__'s help, benchmarking on A10
| With LoRA | Yes |
| Latent Consistency Model | Yes |
| SDXL Turbo | Yes |
| Stable Video Diffusion | Yes |

| Functionality | Supported |
| ----------------------------------- | --------- |
Expand Down
11 changes: 7 additions & 4 deletions examples/optimize_lcm_lora.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
import json
import torch
from PIL import (Image, ImageDraw)
from sfast.compilers.stable_diffusion_pipeline_compiler import (
compile, CompilationConfig)
from sfast.compilers.diffusion_pipeline_compiler import (compile,
CompilationConfig)


def parse_args():
Expand Down Expand Up @@ -234,8 +234,11 @@ def get_kwarg_inputs():
# NOTE: Warm it up.
# The initial calls will trigger compilation and might be very slow.
# After that, it should be very fast.
for _ in range(args.warmups):
model(**get_kwarg_inputs())
if args.warmups > 0:
print('Begin warmup')
for _ in range(args.warmups):
model(**get_kwarg_inputs())
print('End warmup')

# Let's see it!
# Note: Progress bar might work incorrectly due to the async nature of CUDA.
Expand Down
11 changes: 7 additions & 4 deletions examples/optimize_lcm_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
import json
import torch
from PIL import (Image, ImageDraw)
from sfast.compilers.stable_diffusion_pipeline_compiler import (
compile, CompilationConfig)
from sfast.compilers.diffusion_pipeline_compiler import (compile,
CompilationConfig)


def parse_args():
Expand Down Expand Up @@ -234,8 +234,11 @@ def get_kwarg_inputs():
# NOTE: Warm it up.
# The initial calls will trigger compilation and might be very slow.
# After that, it should be very fast.
for _ in range(args.warmups):
model(**get_kwarg_inputs())
if args.warmups > 0:
print('Begin warmup')
for _ in range(args.warmups):
model(**get_kwarg_inputs())
print('End warmup')

# Let's see it!
# Note: Progress bar might work incorrectly due to the async nature of CUDA.
Expand Down
11 changes: 7 additions & 4 deletions examples/optimize_stable_diffusion_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
import json
import torch
from PIL import (Image, ImageDraw)
from sfast.compilers.stable_diffusion_pipeline_compiler import (
compile, CompilationConfig)
from sfast.compilers.diffusion_pipeline_compiler import (compile,
CompilationConfig)


def parse_args():
Expand Down Expand Up @@ -234,8 +234,11 @@ def get_kwarg_inputs():
# NOTE: Warm it up.
# The initial calls will trigger compilation and might be very slow.
# After that, it should be very fast.
for _ in range(args.warmups):
model(**get_kwarg_inputs())
if args.warmups > 0:
print('Begin warmup')
for _ in range(args.warmups):
model(**get_kwarg_inputs())
print('End warmup')

# Let's see it!
# Note: Progress bar might work incorrectly due to the async nature of CUDA.
Expand Down
Loading

0 comments on commit 63ab2a0

Please sign in to comment.