Authors:
Yibin Yanπ€, BUPT Β Β
Yiqin Wangπ€, Tsinghua University
Yansong Tang, Tsinghua-Berkeley Shenzhen Institute
(π€ = equal contribution, names listed alphabetically)
This work is done during Yibin and Yiqin's internship with Prof. Tang.
The demo is now paused. If you do want to try it out, you can reach out Yiqin with [email protected]
β ChatVID combines the knowledge from Large Language Models and the sensing ablity of Vision Models and Audio Models.
β ChatVID demonstrate a powerful capability to talk about anything in the video.
β Please give us a Star! For any questions or suggestions, feel free to drop Yiqin an email at [email protected] or open an issue.
- π Leverage the power of Large Language Models, Vision Models, and Audio Models to enable conversations about videos.
- π€ Utilize Vicuna as the Large Language Model for understanding user queries and responses.
- π· Incorporate state-of-the-art Vision Models like BLIP2, GRiT, and Vid2Seq for visual understanding and analysis.
- π€ Employ Whisper as an Audio Model to process audio content within videos.
- π¬ Enable users to have conversations and discussions about any aspect of a video.
- π Enhance the overall video-watching experience by providing an interactive and engaging platform.
- π ChatVID with Vicuna-7B (8bit) is able to run with a Nvidia GPU with 24G RAM, and 8G CPU RAM.
- π₯ ChatVID needs an extra 10G CPU RAM when using Vid2Seq.
pip install -r pre-requirements.txt
pip install -r requirements.txt
pip install -r extra-requirements.txt # optional, only for vid2seq
You will also need to install ffmpeg for Whisper. Note that if Whisper encounters permission errors, you may need to specify environment variable DATA_GYM_CACHE_DIR='/YourRootDir/ChatVID/.cache'
, a writable cache directory.
Put Grit into pretrained_models
folder.
ChatVID uses frozen Vicuna 7B and 13B models. Please first follow the instructions to prepare Vicuna v1.1 weights.
Then modify the vicuna.model_path
in the Infer Config to the folder that contains Vicuna weights.
-
Prepare CLIP ViT-L/14 Checkpoint for feature extraction in Vid2Seq. Get CLIP ViT-L/14 Checkpoint. Specify the
vid2seq.clip_path
in the Infer Config to the checkpoint path.vid2seq.output_path
is used to store the generated TFRecords and can be specified to any writable directory.vid2seq.work_dir
is the Flax's working directory and can be specified to any writable directory. -
Prepare Vid2Seq ActivityNet Checkpoint Get the Vid2Seq ActivityNet Checkpoint. And then rename it as
checkpoint_200001
. After that, change thevid2seq.checkpoint_path
in the Infer Config to the folder directory where contains the checkpoint.
ChatVID/
|__config/
|__...
|__model/
|__...
|__scenic/
|__...
|__simclr/
|__...
|__pretrained_models/
|__grit_b_densecap_objectdet.pth
|__vicuna-7b/
|__pytorch_model-00001-of-00002.bin
|__pytorch_model-00002-of-00002.bin
|__...
|__vid2seq_ckpt/
|__checkpoint_200001
|__clip_ckpt/
|__ViT-L-14.pt
|__app.py
|__README.md
|__pre-requirements.txt
|__requirements.txt
|__extra-requirements.txt
|__LICENSE
# change all the abs path in config/infer.yaml
python app.py
This work is based on Vicuna, BLIP-2, GRiT, Vid2Seq, Whisper. Thanks for their great work!