StableVideo: Text-driven Consistency-aware Diffusion Video Editing
Wenhao Chai, Xun Guo✉️, Gaoang Wang, Yan Lu
ICCV 2023
boat.mp4
car.mp4
blackswan.mp4
VRAM (MiB) | |
---|---|
float32 | 29145 |
amp | 23005 |
amp + cpu | 17639 |
amp + cpu + xformers | 14185 |
- cpu: use cpu cache, args:
save_memory
under default setting (e.g. resolution, etc.) in app.py
git clone https://github.com/rese1f/StableVideo.git
conda create -n stablevideo python=3.11
pip install -r requirements.txt
(optional) pip install xformers
(optional) We also provide CPU only version huggingface demo.
git lfs install
git clone https://huggingface.co/spaces/Reself/StableVideo
pip install -r requirements.txt
All models and detectors can be downloaded from ControlNet Hugging Face page at Download Link.
Download the example atlas for car-turn, boat, libby, blackswa, bear, bicycle_tali, giraffe, kite-surf, lucia and motorbike at Download Link shared by Text2LIVE authors.
You can also train on your own video following NLA.
And it will create a folder data:
StableVideo
├── ...
├── ckpt
│ ├── cldm_v15.yaml
| ├── dpt_hybrid-midas-501f0c75.pt
│ ├── control_sd15_canny.pth
│ └── control_sd15_depth.pth
├── data
│ └── car-turn
│ ├── checkpoint # NLA models are stored here
│ ├── car-turn # contains video frames
│ ├── ...
│ ├── blackswan
│ ├── ...
└── ...
Run the following command to start.
python app.py
the result .mp4
video and keyframe will be stored in the directory ./log
after clicking render
button.
You can also edit the mask region for the foreground atlas as follows. Currently there might be a bug in Gradio. Please carefully check if the editable output foreground atlas block
looks the same as the one above. If not, try to restart the entire program.
![](https://private-user-images.githubusercontent.com/58205475/264619111-ec8dd9f0-84fb-43ca-baaa-fb6c58da0d77.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkwMDA4NTQsIm5iZiI6MTczOTAwMDU1NCwicGF0aCI6Ii81ODIwNTQ3NS8yNjQ2MTkxMTEtZWM4ZGQ5ZjAtODRmYi00M2NhLWJhYWEtZmI2YzU4ZGEwZDc3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA4VDA3NDIzNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTA1ODI0ZjgyMTgyNGY4NjdmMTJjNDI5MmNjYjEyZTBhNzI5ZDc1MmIxZjY2ODA2ZGIzYjViMTEyNTY3MTk1NmYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.PRz1Xk8a7XvtQxvkDHP571Zp2zMBI75-S0SdXVy5qNg)
If our work is useful for your research, please consider citing as below. Many thanks :)
@article{chai2023stablevideo,
title={StableVideo: Text-driven Consistency-aware Diffusion Video Editing},
author={Chai, Wenhao and Guo, Xun and Wang, Gaoang and Lu, Yan},
journal={arXiv preprint arXiv:2308.09592},
year={2023}
}
This implementation is built partly on Text2LIVE and ControlNet.