Skip to content

Latest commit

 

History

History
154 lines (114 loc) · 6.01 KB

README.md

File metadata and controls

154 lines (114 loc) · 6.01 KB

StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

Sijie Zhao*  Wenbo Hu*  Xiaodong Cun*  Yong Zhang†  Xiaoyu Li† 
Zhe Kong  Xiangjun Gao  Muyao Niu  Ying Shan

 * equal contribution   † corresponding author

Tencent AI Lab  ARC Lab, Tencent PCG

Project Page

💡 Abstract

We propose a novel framework to convert any 2D videos to immersive stereoscopic 3D ones that can be viewed on different display devices, like 3D Glasses, Apple Vision Pro and 3D Display. It can be applied to various video sources, such as movies, vlogs, 3D cartoons, and AIGC videos.

teaser

📣 News

  • 2024/12/27 We released our inference code and model weights.
  • 2024/09/11 We submitted our technical report on arXiv and released our project page.

🎞️ Showcases

Here we show some examples of input videos and their corresponding stereo outputs in Anaglyph 3D format.

🛠️ Installation

1. Set up the environment

We run our code on Python 3.8 and Cuda 11.8. You can use Anaconda or Docker to build this basic environment.

2. Clone the repo

# use --recursive to clone the dependent submodules
git clone --recursive https://github.com/TencentARC/StereoCrafter
cd StereoCrafter

3. Install the requirements

pip install -r requirements.txt

4. Install customized 'Forward-Warp' package for forward splatting

cd ./dependency/Forward-Warp
chmod a+x install.sh
./install.sh

📦 Model Weights

1. Download the SVD img2vid model for the image encoder and VAE.

# in StereoCrafter project root directory
mkdir weights
cd ./weights
git lfs install
git clone https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt-1-1

2. Download the DepthCrafter model for the video depth estimation.

git clone https://huggingface.co/tencent/DepthCrafter

3. Download the StereoCrafter model for the stereo video generation.

git clone https://huggingface.co/TencentARC/StereoCrafter

🔄 Inference

Script:

# in StereoCrafter project root directory
sh run_inference.sh

There are two main steps in this script for generating stereo video.

1. Depth-Based Video Splatting Using the Video Depth from DepthCrafter

Execute the following command:

python depth_splatting_inference.py --pre_trained_path [PATH] --unet_path [PATH]
                                    --input_video_path [PATH] --output_video_path [PATH]

Arguments:

  • --pre_trained_path: Path to the SVD img2vid model weights (e.g., ./weights/stable-video-diffusion-img2vid-xt-1-1).
  • --unet_path: Path to the DepthCrafter model weights (e.g., ./weights/DepthCrafter).
  • --input_video_path: Path to the input video (e.g., ./source_video/camel.mp4).
  • --output_video_path: Path to the output video (e.g., ./outputs/camel_splatting_results.mp4).
  • --max_disp: Parameter controlling the maximum disparity between the generated right video and the input left video. Default value is 20 pixels.

The first step generates a video grid with input video, visualized depth map, occlusion mask, and splatting right video, as shown below:

camel_splatting_results

2. Stereo Video Inpainting of the Splatting Video

Execute the following command:

python inpainting_inference.py --pre_trained_path [PATH] --unet_path [PATH]
                               --input_video_path [PATH] --save_dir [PATH]

Arguments:

  • --pre_trained_path: Path to the SVD img2vid model weights (e.g., ./weights/stable-video-diffusion-img2vid-xt-1-1).
  • --unet_path: Path to the StereoCrafter model weights (e.g., ./weights/StereoCrafter).
  • --input_video_path: Path to the splatting video result generated by the first stage (e.g., ./outputs/camel_splatting_results.mp4).
  • --save_dir: Directory for the output stereo video (e.g., ./outputs).
  • --tile_num: The number of tiles in width and height dimensions for tiled processing, which allows for handling high resolution input without requiring more GPU memory. The default value is 1 (1 $\times$ 1 tile). For input videos with a resolution of 2K or higher, you could use more tiles to avoid running out of memory.

The stereo video inpainting generates the stereo video result in side-by-side format and anaglyph 3D format, as shown below:

camel_sbs

camel_anaglyph

🤝 Acknowledgements

We would like to express our gratitude to the following open-source projects:

  • Stable Video Diffusion: A latent diffusion model trained to generate video clips from an image or text conditioning.
  • DepthCrafter: A novel method to generate temporally consistent depth sequences from videos.

📚 Citation

@article{zhao2024stereocrafter,
  title={Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos},
  author={Zhao, Sijie and Hu, Wenbo and Cun, Xiaodong and Zhang, Yong and Li, Xiaoyu and Kong, Zhe and Gao, Xiangjun and Niu, Muyao and Shan, Ying},
  journal={arXiv preprint arXiv:2409.07447},
  year={2024}
}