This is the official implementation of DRA-Ctrl.
by Hengyuan Cao, Yutong Feng, Biao Gong, Yijing Tian, Yunhong Lu, Chuang Liu, and Bin Wang
[2025-07-10] When the model is not in use, move it to the CPU to reduce GPU memory usage, and apply quantization to further decrease memory requirements. Now our work should be able to run on consumer-grade GPUs. For specific usage, please refer to the Get Started section below. Please note: You need to check the requirements.txt file to update environment dependencies, as this ensures the new features will function properly.
[2025-07-01] Added a new Gradio app (gradio_app_hf.py) designed similarly to our HuggingFace Space, making it easier to switch tasks, adjust parameters, and directly test examples. The previous Gradio app (gradio_app.py) will remain unchanged.
- release code
- release checkpoints
- use quantized version to save VRAM
- using FramePack as base model
Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed Dimension-Reduction Attack (
DRA-Ctrl
), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications.DRA-Ctrl
provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities.
Our method is implemented on Linux with H800 80GB GPU. The peak VRAM consumption stays below 45GB.
conda create --name dra_ctrl python=3.12
pip install -r requirements.txt
We use the community fork for Diffusers-format weights on tencent/HunyuanVideo-I2V as the initialization parameters for the model.
You can download the LoRA weights for various tasks of DRA-Ctrl at this link.
The checkpoint directory is shown below.
DRA-Ctrl/
└── ckpts/
├── HunyuanVideo-I2V/
| ├── image_processor/
| ├── scheduler/
| ...
├── depth-anything-small-hf
| ├── model.safetensors
| ...
├── canny.safetensors
├── coloring.safetensors
├── deblurring.safetensors
├── depth.safetensors
├── depth_pred.safetensors
├── fill.safetensors
├── sr.safetensors
├── subject_driven.safetensors
└── style_transfer.safetensors
To reduce GPU memory requirements, we provide a parameter vram_optimization
to specify different levels of memory optimization schemes. The specific parameters and their meanings are as follows:
No_Optimization
: No optimization is applied, and 48GB of VRAM is sufficient to run the code.
HighRAM_HighVRAM
: No more than 20GB of VRAM is required.
HighRAM_LowVRAM
: No more than 8GB of VRAM is required.
LowRAM_HighVRAM
: No more than 20GB of VRAM is required.
LowRAM_LowVRAM
: No more than 8GB of VRAM is required.
VerylowRAM_LowVRAM
: No more than 8GB of VRAM is required.
Note: Reduced resources will lead to increased generation time.
python gradio_app_hf.py --vram_optimization SET_YOUR_OPTIMIZATION_SCHEME_HERE
Here is the command to run the legacy Gradio app, which we do not recommend using. For easier switching between tasks, adjusting parameters, testing examples, and better VRAM optimization, please use the command above.
python gradio_app.py --config configs/gradio.yaml
In spatially-aligned image generation tasks, when passing the condition image to gradio_app
, there's no need to manually input edge maps, depth maps, or other condition images - only the original image is required. The corresponding condition images will be automatically extracted.
You can use the *_test.jpg
or *_test.png
images from the assets folder as condition images for input to gradio_app
, which will generate the following examples:
Examples:
If you find our work helpful, please cite:
@misc{cao2025dimensionreductionattackvideogenerative,
title={Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis},
author={Hengyuan Cao and Yutong Feng and Biao Gong and Yijing Tian and Yunhong Lu and Chuang Liu and Bin Wang},
year={2025},
eprint={2505.23325},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.23325},
}
This project uses code from the following sources:
- diffusers/models/transformers/transformer_hunyuan_video - Copyright 2024 The HunyuanVideo Team and The HuggingFace Team (Apache 2.0 licensed).
- diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video - Copyright 2024 The HunyuanVideo Team and The HuggingFace Team (Apache 2.0 licensed).
We would like to thank the contributors to the HunyuanVideo, HunyuanVideo-I2V, diffusers and HuggingFace repositories, for their open research and exploration.