Support StableVideoPipeline (#80)

* demonstrate the ability to quantize models * bump version to 0.0.15.post1 & refactor cutlass code & build for torch 2.1.2 * bump version to 1.0.0 and support StableVideoDiffusionPipeline * fix README.md
chengzeyi · Dec 19, 2023 · 63ab2a0 · 63ab2a0
1 parent b6cbf9b
commit 63ab2a0
Show file tree

Hide file tree

Showing 21 changed files with 804 additions and 475 deletions.
diff --git a/.github/workflows/wheels.yml b/.github/workflows/wheels.yml
@@ -31,8 +31,8 @@ jobs:
           - "3.10"
           - "3.11"
         torch_version:
-          - "2.1.0"
           - "2.1.1"
+          - "2.1.2"
         cuda_short_version:
           - "118"
           - "121"

diff --git a/README.md b/README.md
@@ -6,17 +6,20 @@
 
 [Discord Channel](https://discord.gg/kQFvfzM4SJ)
 
-`stable-fast` achieves SOTA inference performance on __ALL__ kinds of diffuser models.
+`stable-fast` achieves SOTA inference performance on __ALL__ kinds of diffuser models, even with the latest `StableVideoDiffusionPipeline`.
 And unlike `TensorRT` or `AITemplate`, which takes dozens of minutes to compile a model, `stable-fast` only takes a few seconds to compile a model.
 `stable-fast` also supports `dynamic shape`, `LoRA` and `ControlNet` out of the box.
 
-[![](https://mermaid.ink/img/pako:eNpFUk2PmzAQ_SsjS1EuQGxYEuBQqdL20EMvuzlUXe9hgAEsgY2w2U2E-O81odoePJp5Hs1787GwytTECnY4LEorV8BydB0NdCzg2JiJrDuusB4OUt_uVYeTC0tyKLVTrieQ7KoGAp8ILWma0CndgtEEasCW4FO5Dl6fQURpAKmIb_55h4N1NNrAZ8LL9Tc88YxL5ilCvCkLb5J9oFZ9j-DMVHWSBZ7p4UaVGUbV0w59_3mlYezR_Ys9cdObzz24krZmerlCHvFI7Jh1WPYUNmidZO9S33dCyQbPpixVRtdWMuAQht8g5pxLLXWJE7yJLL8EIFLBvRVp5i3nSQB5LjaTvrOADTQNqGo_zkVq2DRvo5Ss8G6JlrYeV5-HszOvd12xwk0zBWwea9_Ds8J2woEVDfb2C_1RK9_5F9gbrMmHC3P3cVtcq6zzJb3yRrUbPk-9hzvnRlucTtt31Po1zOU2u5NV9bbF7iM_n87xOcM4ofMlwTRJ6qoUedbET6KpL1zEyNY1YCPqP8b8V0UPPb_2q3kcz_oX1ja5bQ?type=png)](https://mermaid.live/edit#pako:eNpFUk2PmzAQ_SsjS1EuQGxYEuBQqdL20EMvuzlUXe9hgAEsgY2w2U2E-O81odoePJp5Hs1787GwytTECnY4LEorV8BydB0NdCzg2JiJrDuusB4OUt_uVYeTC0tyKLVTrieQ7KoGAp8ILWma0CndgtEEasCW4FO5Dl6fQURpAKmIb_55h4N1NNrAZ8LL9Tc88YxL5ilCvCkLb5J9oFZ9j-DMVHWSBZ7p4UaVGUbV0w59_3mlYezR_Ys9cdObzz24krZmerlCHvFI7Jh1WPYUNmidZO9S33dCyQbPpixVRtdWMuAQht8g5pxLLXWJE7yJLL8EIFLBvRVp5i3nSQB5LjaTvrOADTQNqGo_zkVq2DRvo5Ss8G6JlrYeV5-HszOvd12xwk0zBWwea9_Ds8J2woEVDfb2C_1RK9_5F9gbrMmHC3P3cVtcq6zzJb3yRrUbPk-9hzvnRlucTtt31Po1zOU2u5NV9bbF7iM_n87xOcM4ofMlwTRJ6qoUedbET6KpL1zEyNY1YCPqP8b8V0UPPb_2q3kcz_oX1ja5bQ)
+[![](https://mermaid.ink/img/pako:eNpVUsFu2zAM_RVCQJCLnchOncQ-DBjWHXbYpc1hWNUDbdO2AFsyLKVNFvjfR8VFhwkwQT5LfOQjb6KyNYlCxHGsTGVNo9tCGeBzuX7rcPIfUTjvuvZdAUcp_2Ed6bbzBaQZg_ckq9VNG83Qbe07GmhdwLqxEzm_nmFerZS5XKuQOS7JI3R20n-s8dgr47XvCZQ46YGA30BLhib02rRgDYEesCUuw3fw_AjJJosgS9ILf-xIcJ5GF_FNeDr9ggd5lEowW4wX7eBFiTc0uu8RvJ2qTomIme7uprLDqHtaoK8_TjSMPfqPmImb3r4vwYmMs9PTCfKN3CQL5jyWPcUNOq_EqzLXhVCJgdm0I1a1dkqAhDj-AqkM8ilT4gQvyTE_RJBkiWSbZEe2Uu4iyPMkmOxVRGKgaUBd84xuQXauOaiqRMFuiY5CjzPfw7O3z1dTicJPZ4rEeay5h0eN7YTD_-D3WnPjomiwdwz2Fmvi8Cb8dQzL0GrnOeOyDgE_Tz3DnfejK7bb8HvT8hTOZZBu63Qd5tm95fvtPt0fMd3R_rDDbLerqzLJj036kDT1QSYpinmOxIjmt7XDZwF0r-fnson3hZz_AskE0h8?type=png)](https://mermaid.live/edit#pako:eNpVUsFu2zAM_RVCQJCLnchOncQ-DBjWHXbYpc1hWNUDbdO2AFsyLKVNFvjfR8VFhwkwQT5LfOQjb6KyNYlCxHGsTGVNo9tCGeBzuX7rcPIfUTjvuvZdAUcp_2Ed6bbzBaQZg_ckq9VNG83Qbe07GmhdwLqxEzm_nmFerZS5XKuQOS7JI3R20n-s8dgr47XvCZQ46YGA30BLhib02rRgDYEesCUuw3fw_AjJJosgS9ILf-xIcJ5GF_FNeDr9ggd5lEowW4wX7eBFiTc0uu8RvJ2qTomIme7uprLDqHtaoK8_TjSMPfqPmImb3r4vwYmMs9PTCfKN3CQL5jyWPcUNOq_EqzLXhVCJgdm0I1a1dkqAhDj-AqkM8ilT4gQvyTE_RJBkiWSbZEe2Uu4iyPMkmOxVRGKgaUBd84xuQXauOaiqRMFuiY5CjzPfw7O3z1dTicJPZ4rEeay5h0eN7YTD_-D3WnPjomiwdwz2Fmvi8Cb8dQzL0GrnOeOyDgE_Tz3DnfejK7bb8HvT8hTOZZBu63Qd5tm95fvtPt0fMd3R_rDDbLerqzLJj036kDT1QSYpinmOxIjmt7XDZwF0r-fnson3hZz_AskE0h8)
 
-| Framework | torch | torch.compile | AIT  | oneflow | TensorRT | __stable-fast__ |
-| --------- | ----- | ------------- | ---- | ------- | -------- | --------------- |
-| Time/ms   | 1897  | 1510          | 1158 | 1003    | 991      | __995__         |
+[![](https://mermaid.ink/img/pako:eNpFUk1v2zAM_SuEgCAXu7GdL8eHXdYdd1mDoVjVA23RtgBbMiwmSxb4v49OipYn6VHie3zkTVXekCpUHMfaVd7Vtim0A4nL9XuLI3_c5vhrDbcF5EnyhbVkm5YLSLcC3ossFjfrrEC3JbfU07KAZe1HCrycYFostLtcq7lyXBIjtH60_7xj7LRjyx2BVkfbE8gfaMjRiGxdA94RnK0hLzK4hZffz_HrMYI0yTaXbZpFkG2hHrGnEMlb-HV8hU2SJ1oJX4wXG-BNqzM623UI7Meq1SoSrvvxqfL9YDt6QIGx7CiuMbBW79pdH_8lQeKQCVpBAnH8Tcilae1KHOHtkEewTyLY5O8qUj2NPVojxt5mr4RmtkKrQo4lBpplTfIOT-xfrq5SBY8nitRpMMj0bLGRVlRRYxc-0R_GithPsPNoSK43xddhHmFjA0vJxxBn_DR2ArfMQyhWqzn91Ih3p3JudxWsmafQng-71S7b5Zitabdf43a9NlWZHvI626S12SdphmqaIjWg--P9lyq66_n52J_7Gk3_AVazvkQ?type=png)](https://mermaid.live/edit#pako:eNpFUk1v2zAM_SuEgCAXu7GdL8eHXdYdd1mDoVjVA23RtgBbMiwmSxb4v49OipYn6VHie3zkTVXekCpUHMfaVd7Vtim0A4nL9XuLI3_c5vhrDbcF5EnyhbVkm5YLSLcC3ossFjfrrEC3JbfU07KAZe1HCrycYFostLtcq7lyXBIjtH60_7xj7LRjyx2BVkfbE8gfaMjRiGxdA94RnK0hLzK4hZffz_HrMYI0yTaXbZpFkG2hHrGnEMlb-HV8hU2SJ1oJX4wXG-BNqzM623UI7Meq1SoSrvvxqfL9YDt6QIGx7CiuMbBW79pdH_8lQeKQCVpBAnH8Tcilae1KHOHtkEewTyLY5O8qUj2NPVojxt5mr4RmtkKrQo4lBpplTfIOT-xfrq5SBY8nitRpMMj0bLGRVlRRYxc-0R_GithPsPNoSK43xddhHmFjA0vJxxBn_DR2ArfMQyhWqzn91Ih3p3JudxWsmafQng-71S7b5Zitabdf43a9NlWZHvI626S12SdphmqaIjWg--P9lyq66_n52J_7Gk3_AVazvkQ)
 
-__NOTE__: During benchmarking, `TensorRT` is tested with `static batch size` and `CUDA Graph enabled` while `stable-fast` is running with full dynamic shape.
+| Model       | torch | torch.compile | AIT  | oneflow | TensorRT | __stable-fast__ |
+| ----------- | ----- | ------------- | ---- | ------- | -------- | --------------- |
+| SD 1.5 (ms) | 1897  | 1510          | 1158 | 1003    | 991      | __995__         |
+| SVD-XT (s)  | 98    | 70            |      |         |          | __48__          |
+
+__NOTE__: During benchmarking, `TensorRT` is tested with `static batch size` and `CUDA Graph enabled` while `stable-fast` is running with dynamic shape.
 
 - [🚀Stable Fast](#stable-fast)
   - [Introduction](#introduction)
@@ -28,6 +31,7 @@ __NOTE__: During benchmarking, `TensorRT` is tested with `static batch size` and
   - [Usage](#usage)
     - [Optimize StableDiffusionPipeline](#optimize-stablediffusionpipeline)
     - [Optimize LCM Pipeline](#optimize-lcm-pipeline)
+    - [Optimize StableVideoDiffusionPipeline](#optimize-stablevideodiffusionpipeline)
     - [Dynamically Switch LoRA](#dynamically-switch-lora)
     - [Model Quantization](#model-quantization)
     - [Some Common Methods To Speed Up PyTorch](#some-common-methods-to-speed-up-pytorch)
@@ -40,8 +44,6 @@ __NOTE__: During benchmarking, `TensorRT` is tested with `static batch size` and
 
 ## Introduction
 
-__NOTE__: `stable-fast` is currently only in beta stage and is prone to be buggy, feel free to try it out and give suggestions!
-
 ### What is this?
 
 `stable-fast` is an ultra lightweight inference optimization framework for __HuggingFace Diffusers__ on __NVIDIA GPUs__.
@@ -51,7 +53,7 @@ __NOTE__: `stable-fast` is currently only in beta stage and is prone to be buggy
 - __Low Precision & Fused GEMM__: `stable-fast` implements a series of fused GEMM operators that compute with `fp16` precision, which is fast than PyTorch's defaults (read & write with `fp16` while compute with `fp32`).
 - __Fused Linear GEGLU__: `stable-fast` is able to fuse `GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c)` into one CUDA kernel.
 - __NHWC & Fused GroupNorm__: `stable-fast` implements a highly optimized fused NHWC `GroupNorm + Silu` operator with OpenAI's `Triton`, which eliminates the need of memory format permutation operators.
-- __Fully Traced Model__: `stable-fast` improves the `torch.jit.trace` interface to make it more proper for tracing complex models. Nearly every part of `StableDiffusionPipeline` can be traced and converted to __TorchScript__. It is more stable than `torch.compile` and has a significantly lower CPU overhead than `torch.compile` and supports __ControlNet__ and __LoRA__.
+- __Fully Traced Model__: `stable-fast` improves the `torch.jit.trace` interface to make it more proper for tracing complex models. Nearly every part of `StableDiffusionPipeline/StableVideoDiffusionPipeline` can be traced and converted to __TorchScript__. It is more stable than `torch.compile` and has a significantly lower CPU overhead than `torch.compile` and supports __ControlNet__ and __LoRA__.
 - __CUDA Graph__: `stable-fast` can capture the `UNet`, `VAE` and `TextEncoder` into CUDA Graph format, which can reduce the CPU overhead when the batch size is small. This implemention also supports dynamic shape.
 - __Fused Multihead Attention__: `stable-fast` just uses xformers and makes it compatible with __TorchScript__.
 
@@ -64,14 +66,14 @@ But I still need to do some work to make it more stable and easy to use and prov
 
 - __Fast__: `stable-fast` is specialy optimized for __HuggingFace Diffusers__. It achieves a high performance across many libraries. And it provides a very fast compilation speed within only a few seconds. It is significantly faster than `torch.compile`, `TensorRT` and `AITemplate` in compilation time.
 - __Minimal__: `stable-fast` works as a plugin framework for `PyTorch`. It utilizes existing `PyTorch` functionality and infrastructures and is compatible with other acceleration techniques, as well as popular fine-tuning techniques and deployment solutions.
-- __Maximum Compatibility__: `stable-fast` is compatible with all kinds of `HuggingFace Diffusers` and `PyTorch` versions. It is also compatible with `ControlNet` and `LoRA`.
+- __Maximum Compatibility__: `stable-fast` is compatible with all kinds of `HuggingFace Diffusers` and `PyTorch` versions. It is also compatible with `ControlNet` and `LoRA`. And it even supports the latest `StableVideoDiffusionPipeline` out of the box!
 
 ## Installation
 
 __NOTE__: `stable-fast` is currently only tested on `Linux` and `WSL2 in Windows`.
 You need to install PyTorch with CUDA support at first (versions from 1.12 to 2.1 are suggested).
 
-I only test `stable-fast` with `torch==2.1.0`, `xformers==0.0.22` and `triton==2.1.0` on `CUDA 12.1` and `Python 3.10`.
+I only test `stable-fast` with `torch>=2.1.0`, `xformers>=0.0.22` and `triton>=2.1.0` on `CUDA 12.1` and `Python 3.10`.
 Other versions might build and run successfully but that's not guaranteed.
 
 ### Install Prebuilt Wheels
@@ -96,7 +98,7 @@ pip3 install --index-url https://download.pytorch.org/whl/cu121 'diffusers>=0.19
 # Install PyTorch with CUDA and other packages at first.
 # Windows user: Triton might be not available, you could skip it.
 # NOTE: 'wheel' is required or you will meet `No module named 'torch'` error when building.
-pip3 install wheel 'torch>=1.12.0' 'diffusers>=0.19.3' 'xformers>=0.0.20' 'triton>=2.1.0'
+pip3 install wheel 'torch>=2.1.0' 'xformers>=0.0.22' 'triton>=2.1.0' 'diffusers>=0.19.3'
 
 # (Optional) Makes the build much faster.
 pip3 install ninja
@@ -124,9 +126,8 @@ import time
 import torch
 from diffusers import (StableDiffusionPipeline,
                        EulerAncestralDiscreteScheduler)
-from sfast.compilers.stable_diffusion_pipeline_compiler import (
-    compile, CompilationConfig)
-
+from sfast.compilers.diffusion_pipeline_compiler import (compile,
+                                                         CompilationConfig)
 
 def load_model():
     model = StableDiffusionPipeline.from_pretrained(
@@ -139,7 +140,6 @@ def load_model():
     model.to(torch.device('cuda'))
     return model
 
-
 model = load_model()
 
 config = CompilationConfig.Default()
@@ -155,6 +155,8 @@ try:
 except ImportError:
     print('Triton not installed, skip')
 # CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
+# But it can increase the amount of GPU memory used.
+# For StableVideoDiffusionPipeline it is not needed.
 config.enable_cuda_graph = True
 
 model = compile(model, config)
@@ -197,6 +199,12 @@ You can check this Colab to see how it works on T4 GPU: [![Open In Colab](https:
 Refer to [examples/optimize_lcm_pipeline.py](examples/optimize_lcm_lora.py) for more details about how to optimize normal SD model with LCM LoRA.
 Refer to [examples/optimize_lcm_pipeline.py](examples/optimize_lcm_pipeline.py) for more details about how to optimize the standalone LCM model.
 
+### Optimize StableVideoDiffusionPipeline
+
+`stable-fast` is able to optimize the newest `StableVideoDiffusionPipeline` and achieve a `2x` speedup
+
+Refer to [examples/optimize_stable_video_diffusion_pipeline.py](examples/optimize_stable_video_diffusion_pipeline.py) for more details
+
 ### Dynamically Switch LoRA
 
 Switching LoRA dynamically is supported but you need to do some extra work.
@@ -333,6 +341,7 @@ Thanks for __@SuperSecureHuman__ and __@jon-chuang__'s help, benchmarking on A10
 | With LoRA                           | Yes       |
 | Latent Consistency Model            | Yes       |
 | SDXL Turbo                          | Yes       |
+| Stable Video Diffusion              | Yes       |
 
 | Functionality                       | Supported |
 | ----------------------------------- | --------- |

diff --git a/examples/optimize_lcm_lora.py b/examples/optimize_lcm_lora.py
@@ -20,8 +20,8 @@
 import json
 import torch
 from PIL import (Image, ImageDraw)
-from sfast.compilers.stable_diffusion_pipeline_compiler import (
-    compile, CompilationConfig)
+from sfast.compilers.diffusion_pipeline_compiler import (compile,
+                                                         CompilationConfig)
 
 
 def parse_args():
@@ -234,8 +234,11 @@ def get_kwarg_inputs():
     # NOTE: Warm it up.
     # The initial calls will trigger compilation and might be very slow.
     # After that, it should be very fast.
-    for _ in range(args.warmups):
-        model(**get_kwarg_inputs())
+    if args.warmups > 0:
+        print('Begin warmup')
+        for _ in range(args.warmups):
+            model(**get_kwarg_inputs())
+        print('End warmup')
 
     # Let's see it!
     # Note: Progress bar might work incorrectly due to the async nature of CUDA.

diff --git a/examples/optimize_lcm_pipeline.py b/examples/optimize_lcm_pipeline.py
@@ -20,8 +20,8 @@
 import json
 import torch
 from PIL import (Image, ImageDraw)
-from sfast.compilers.stable_diffusion_pipeline_compiler import (
-    compile, CompilationConfig)
+from sfast.compilers.diffusion_pipeline_compiler import (compile,
+                                                         CompilationConfig)
 
 
 def parse_args():
@@ -234,8 +234,11 @@ def get_kwarg_inputs():
     # NOTE: Warm it up.
     # The initial calls will trigger compilation and might be very slow.
     # After that, it should be very fast.
-    for _ in range(args.warmups):
-        model(**get_kwarg_inputs())
+    if args.warmups > 0:
+        print('Begin warmup')
+        for _ in range(args.warmups):
+            model(**get_kwarg_inputs())
+        print('End warmup')
 
     # Let's see it!
     # Note: Progress bar might work incorrectly due to the async nature of CUDA.

diff --git a/examples/optimize_stable_diffusion_pipeline.py b/examples/optimize_stable_diffusion_pipeline.py
@@ -20,8 +20,8 @@
 import json
 import torch
 from PIL import (Image, ImageDraw)
-from sfast.compilers.stable_diffusion_pipeline_compiler import (
-    compile, CompilationConfig)
+from sfast.compilers.diffusion_pipeline_compiler import (compile,
+                                                         CompilationConfig)
 
 
 def parse_args():
@@ -234,8 +234,11 @@ def get_kwarg_inputs():
     # NOTE: Warm it up.
     # The initial calls will trigger compilation and might be very slow.
     # After that, it should be very fast.
-    for _ in range(args.warmups):
-        model(**get_kwarg_inputs())
+    if args.warmups > 0:
+        print('Begin warmup')
+        for _ in range(args.warmups):
+            model(**get_kwarg_inputs())
+        print('End warmup')
 
     # Let's see it!
     # Note: Progress bar might work incorrectly due to the async nature of CUDA.