Skip to content

Latest commit





Folders and files

Last commit message
Last commit date

parent directory


AnimateDiff based on MindSpore

This repository is the MindSpore implementation of AnimateDiff.


  • Text-to-video generation with AnimdateDiff v2, supporting 16 frames @512x512 resolution on Ascend 910B, 16 frames @256x256 resolution on GPU 3090
  • MotionLoRA inference
  • Motion Module Training
  • Motion LoRA Training
  • AnimateDiff v3 Inference
  • AnimateDiff v3 Training
  • SDXL support


pip install -r requirements.txt

In case decord package is not available, try pip install eva-decord. For EulerOS, instructions on ffmpeg and decord installation are as follows.

1. install ffmpeg 4, referring to
    wget --no-check-certificate
    tar -xvf ffmpeg-4.0.1.tar.bz2
    mv ffmpeg-4.0.1 ffmpeg
    cd ffmpeg
    ./configure --enable-shared         # --enable-shared is needed for sharing libavcodec with decord
    make -j 64
    make install
2. install decord, referring to
    git clone --recursive
    cd decord
    rm build && mkdir build && cd build
    cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release
    make -j 64
    make install
    cd ../python
    python3 install --user

Prepare Model Weights

First, download the torch pretrained weights referring to torch animatediff checkpoints.

  • Convert SD dreambooth model

To download ToonYou-Beta3 dreambooth model, please refer to this civitai website, or use the following command:

wget -P models/torch_ckpts/ --content-disposition --no-check-certificate

After downloading this dreambooth checkpoint under animatediff/models/torch_ckpts/, convert the dreambooth checkpoint using:

cd ../examples/stable_diffusion_v2
python tools/model_conversion/  --source ../animatediff/models/torch_ckpts/toonyou_beta3.safetensors   --target models/toonyou_beta3.ckpt  --model sdv1  --source_version pt

In addition, please download RealisticVision V5.1 dreambooth checkpoint and convert it similarly.

  • Convert Motion Module
cd ../examples/animatediff/tools
python --src ../torch_ckpts/mm_sd_v15_v2.ckpt --tar ../models/motion_module

If converting the animatediff v3 motion module checkpoint,

cd ../examples/animatediff/tools
python -v v3 --src ../torch_ckpts/v3_sd15_mm.ckpt  --tar ../models/motion_module
  • Convert Motion LoRA
cd ../examples/animatediff/tools
python --src ../torch_ckpts/.ckpt --tar ../models/motion_lora
  • Convert Domain Adapter LoRA
cd ../examples/animatediff/tools
python --src ../torch_ckpts/v3_sd15_adapter.ckpt --tar ../models/domain_adapter_lora
  • Convert SparseCtrl Encoder
cd ../examples/animatediff/tools
python --src ../torch_ckpts/v3_sd15_sparsectrl_{}.ckpt --tar ../models/sparsectrl_encoder

The full tree of expected checkpoints is shown below:

├── domain_adapter_lora
│   └── v3_sd15_adapter.ckpt
├── dreambooth_lora
│   ├── realisticVisionV51_v51VAE.ckpt
│   └── toonyou_beta3.ckpt
├── motion_lora
│   └── v2_lora_ZoomIn.ckpt
├── motion_module
│   ├── mm_sd_v15.ckpt
│   ├── mm_sd_v15_v2.ckpt
│   └── v3_sd15_mm.ckpt
├── sparsectrl_encoder
│   ├── v3_sd15_sparsectrl_rgb.ckpt
│   └── v3_sd15_sparsectrl_scribble.ckpt
└── stable_diffusion
    └── sd_v1.5-d0ab7146.ckpt

Inference (AnimateDiff v3 and SparseCtrl)

  • Running On Ascend 910*:
# download demo images
bash scripts/

# under general T2V setting
python --config configs/prompts/v3/v3-1-T2V.yaml

# image animation (on RealisticVision)
python --config configs/prompts/v3/v3-2-animation-RealisticVision.yaml

# sketch-to-animation and storyboarding (on RealisticVision)
python --config configs/prompts/v3/v3-3-sketch-RealisticVision.yaml


Input (by RealisticVision) Animation Input Animation
Input Scribble Output Input Scribbles Output
  • Running on GPU:

Please append --device_target GPU to the end of the commands above.

If you use the checkpoint converted from torch for inference, please also append --vae_fp16=False to the command above.

Inference (AnimateDiff v2)


  • Running On Ascend 910*:
python --config configs/prompts/v2/1-ToonYou.yaml --L 16 --H 512 --W 512

By default, DDIM sampling is used, and the sampling speed is 1.07s/iter.


  • Running on GPU:
python --config configs/prompts/v2/1-ToonYou.yaml --L 16 --H 256 --W 256 --device_target GPU

If you use the checkpoint converted from torch for inference, please also append --vae_fp16=False to the command above.

Motion LoRA

  • Running On Ascend 910*:
python --config configs/prompts/v2/1-ToonYou-MotionLoRA.yaml --L 16 --H 512 --W 512

By default, DDIM sampling is used, and the sampling speed is 1.07s/iter.

Results using Zoom-In motion lora:

  • Running on GPU:
python --config configs/prompts/v2/1-ToonYou-MotionLoRA.yaml --L 16 --H 256 --W 256 --device_target GPU


Image Finetuning

python --config configs/training/image_finetune.yaml

For 910B, please set export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" before running training.

Motion Module Training

python --config configs/training/mmv2_train.yaml

For 910B, please set export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" before running training.

You may change the arguments including data path, output directory, lr, etc in the yaml config file. You can also change by command line arguments referring to or python --help

  • Evaluation

To infer with the trained model, run

python --config configs/prompts/v2/base_video.yaml \
    --motion_module_path {path to saved checkpoint} \
    --prompt  {text prompt}  \

You can also create a new config yaml to specify the prompts to test and the motion moduel path based on configs/prompt/v2/base_video.yaml.

Here are some generation results after MM training on 512x512 resolution and 16-frame data.

Disco light leaks disco ball light reflections shaped rectangular and line with motion blur effect Cloudy moscow kremlin time lapse Sharp knife to cut delicious smoked fish A baker turns freshly baked loaves of sourdough bread

Motion LoRA Training

python --config configs/training/mmv2_lora.yaml

For 910B, please set export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" before running training.

  • Evaluation

To infer with the trained model, run

python --config configs/prompts/v2/base_video.yaml \
    --motion_lora_path {path to saved checkpoint} \
    --prompt  {text prompt}  \

Here are some generation results after lora fine-tuning on 512x512 resolution and 16-frame data.

Disco light leaks disco ball light reflections shaped rectangular and line with motion blur effect Cloudy moscow kremlin time lapse Sharp knife to cut delicious smoked fish A baker turns freshly baked loaves of sourdough bread

Training on GPU

Please add --device_target GPU in the above training commands and adjust image_size/num_frames/train_batch_size to fit your device memory. Below is an example for 3090.

# reduce num frames and batch size to avoid OOM in 3090
python --config configs/training/mmv2_train.yaml --data_path ../videocomposer/datasets/webvid5 --image_size 256 --num_frames=4 --device_target GPU --train_batch_size=1



Model Context Scheduler Steps Resolution Frame Speed (step/s) Time(s/video)
AnimateDiff v2 D910*x1-MS2.2.10 DDIM 30 512x512 16 1.2 25

Context: {Ascend chip}-{number of NPUs}-{mindspore version}.


Model Context Task Local BS x Grad. Accu. Resolution Frame Step T. (s/step)
AnimateDiff v2 D910*x1-MS2.2.10 MM training 1x1 512x512 16 1.29
AnimateDiff v2 D910*x1-MS2.2.10 Motion Lora 1x1 512x512 16 1.26
AnimateDiff v2 D910*x1-MS2.2.10 MM training w/ Embed. cached 1x1 512x512 16 0.75
AnimateDiff v2 D910*x1-MS2.2.10 Motion Lora w/ Embed. cached 1x1 512x512 16 0.71

Context: {Ascend chip}-{number of NPUs}-{mindspore version}.

MM training: Motion Module training

Embed. cached: The video embedding (VAE-encoder outputs) and text embedding are pre-computed and stored before diffusion training.