Skip to content

faster parallel inference of mochi-1 video generation model

License

Notifications You must be signed in to change notification settings

xdit-project/mochi-xdit

Repository files navigation

mochi-xdit: Parallel Inference for Mochi-preview Video Generation Model with xDiT

📝 [Blog]Enhancing Parallelism and Speedup for xDiT in Serving the Mochi-1 Video Generation Model

This repository provides an accelerated way to delpoy the Video Generation Model Mochi 1 using Unified Sequence Parallelism provided by xDiT.

Mochi-1 originally ran on 4xH100(80GB) GPUs, however, we made it run on a single L40(48GB) GPU with no accuracy loss!

Moreover, by applying xDiT, we successfully reduced the latency of generating a 49-frame 848x480 resolution video from 398 seconds (6 minutes 38 seconds) to 74 seconds (1 minute 14 seconds) on 6xL40 GPUs. It is able to reduce the inference latency by 3.54x compared to the official open source implementation on 6xL40 GPUs by improve the parallelism and better utilizing the memory!

Metric 1x L40 GPU 2x L40 GPU
(uly=2)
2x L40 GPU
(cfg=2)
6x L40 GPU
(cfg=2, ring=3)
Performance 394s 222s (1.77x) 198s (1.99x) 74s (5.32x)
Memory 30.83 GB 35.05 GB 36.69 GB 30.94 GB
Preview 1 GPU 2 GPU Ulysses 2 GPU CFG 6 GPU

The prompt of the video is: "Witness a grand space battle between starships, with lasers cutting through the darkness of space and explosions illuminating the void".

HightLights

  1. Memory optimization makes mochi is able to generate video on a single 48GB L40 GPU without no accuracy loss.
  2. Tiled VAE decoder enables the correct generation of video with any resolution, as well as reducing the memory footprint.
  3. Unified Sequence Parallelism (USP) for AsymmetricAttention using xDiT: hybrid 2D sequence parallelism with Ring-Attention and DeepSpeed-Ulysses.
  4. CFG parallel from xDiT is applied by us in Mochi-1 in a simple way.

Usage

This repository provides an accelerated inference version of Mochi 1 using Unified Sequence Parallelism provided by xDiT.

Feature xDiT Version Original Version
Attention Parallel Ulysses+Ring+CFG Ulysses
VAE Variable Size Fixed Size
Model Loading Replicated/FSDP FSDP

Usage

1. Install from source

pip install xfuser
sudo apt install ffmpeg
pip install .

2. Install from docker

docker pull thufeifeibear/mochi-dev:0.1

3. Run

Running mochi with a single GPU

CUDA_VISIBLE_DEVICES=0 python3 ./demos/cli.py --model_dir "<path_to_downloaded_directory>" --prompt "prompt"

Running mochi with multiple GPUs using Unified Sequence Parallelism provided by xDiT.

world_size is the total number of GPU used for video generation. Use the number of GPUs in CUDA_VISIBLE_DEVICES to control world_size.

Adjust the configuration of ulysses_degree, ring_degree, and CFG parallel degree to achieve optimal performance. If cfg_parallel is enabled, ulysses_degree x ring_degree = world_size. Otherwise, ulysses_degree x ring_degree x 2 = world_size.

E.g.,

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
python3 ./demos/cli.py --model_dir "<path_to_downloaded_directory>" --prompt "prompt" \
 --use_xdit --ulysses_degree 3 --ring_degree 2

or

export CUDA_VISIBLE_DEVICES=0,1,2,4,5,6
python3 ./demos/cli.py --model_dir "<path_to_downloaded_directory>" --prompt "prompt" \
 --use_xdit --ulysses_degree 3 --ring_degree 1 --cfg_parallel

4. Performance

The comparison in latency betweem mochi-xDiT and the original mochi inference (Baseline) is shown as follows.

L40 performance

We also try flash_attn 3 with FP8 support on Hopper GPUs. The latency of mochi-xDiT with flash_attn 2 and flash_attn 3 are compared in the following figure.

H20 performance

References

xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

@article{fang2024xdit,
  title={xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism},
  author={Fang, Jiarui and Pan, Jinzhe and Sun, Xibo and Li, Aoyu and Wang, Jiannan},
  journal={arXiv preprint arXiv:2411.01738},
  year={2024}
}

USP: A Unified Sequence Parallelism Approach for Long Context Generative AI

@article{fang2024unified,
  title={A Unified Sequence Parallelism Approach for Long Context Generative AI},
  author={Fang, Jiarui and Zhao, Shangchun},
  journal={arXiv preprint arXiv:2405.07719},
  year={2024}
}

Unveiling Redundancy in Diffusion Transformers (DiTs): A Systematic Study

@article{sun2024unveiling,
  title={Unveiling Redundancy in Diffusion Transformers (DiTs): A Systematic Study},
  author={Sun, Xibo and Fang, Jiarui and Li, Aoyu and Pan, Jinzhe},
  journal={arXiv preprint arXiv:2411.13588},
  year={2024}
}

About

faster parallel inference of mochi-1 video generation model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages