Skip to content

InvincibleWyq/ChatVID

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


ChatVID

πŸ’¬ Chat about anything on any video! πŸŽ₯

Authors:
Yibin Yan🀝, BUPT    Yiqin Wang🀝, Tsinghua University
Yansong Tang, Tsinghua-Berkeley Shenzhen Institute
(🀝 = equal contribution, names listed alphabetically)
This work is done during Yibin and Yiqin's internship with Prof. Tang.

Try our demoπŸ€— Hugging Face Spaces

The demo is now paused. If you do want to try it out, you can reach out Yiqin with [email protected]

Intro to ChatVID

⭐ ChatVID combines the knowledge from Large Language Models and the sensing ablity of Vision Models and Audio Models.

⭐ ChatVID demonstrate a powerful capability to talk about anything in the video.

⭐ Please give us a Star! For any questions or suggestions, feel free to drop Yiqin an email at [email protected] or open an issue.

Highlights πŸ”₯

  • πŸ” Leverage the power of Large Language Models, Vision Models, and Audio Models to enable conversations about videos.
  • πŸ€– Utilize Vicuna as the Large Language Model for understanding user queries and responses.
  • πŸ“· Incorporate state-of-the-art Vision Models like BLIP2, GRiT, and Vid2Seq for visual understanding and analysis.
  • 🎀 Employ Whisper as an Audio Model to process audio content within videos.
  • πŸ’¬ Enable users to have conversations and discussions about any aspect of a video.
  • πŸš€ Enhance the overall video-watching experience by providing an interactive and engaging platform.
  • πŸš— ChatVID with Vicuna-7B (8bit) is able to run with a Nvidia GPU with 24G RAM, and 8G CPU RAM.
  • πŸŽ₯ ChatVID needs an extra 10G CPU RAM when using Vid2Seq.

Gradio Example ✨

The Temple Of Heaven Cook

image

image

Install Instructions πŸ’»

pip install -r pre-requirements.txt
pip install -r requirements.txt
pip install -r extra-requirements.txt # optional, only for vid2seq

You will also need to install ffmpeg for Whisper. Note that if Whisper encounters permission errors, you may need to specify environment variable DATA_GYM_CACHE_DIR='/YourRootDir/ChatVID/.cache', a writable cache directory.

Setting Up Checkpoints πŸ“¦πŸ’Ό

Grit Checkpoints πŸš€

Put Grit into pretrained_models folder.

Vicuna Weights πŸ¦™

ChatVID uses frozen Vicuna 7B and 13B models. Please first follow the instructions to prepare Vicuna v1.1 weights. Then modify the vicuna.model_path in the Infer Config to the folder that contains Vicuna weights.

Vid2Seq Checkpoints (Optional) πŸŽ₯πŸ“Š

  1. Prepare CLIP ViT-L/14 Checkpoint for feature extraction in Vid2Seq. Get CLIP ViT-L/14 Checkpoint. Specify the vid2seq.clip_path in the Infer Config to the checkpoint path. vid2seq.output_path is used to store the generated TFRecords and can be specified to any writable directory. vid2seq.work_dir is the Flax's working directory and can be specified to any writable directory.

  2. Prepare Vid2Seq ActivityNet Checkpoint Get the Vid2Seq ActivityNet Checkpoint. And then rename it as checkpoint_200001. After that, change the vid2seq.checkpoint_path in the Infer Config to the folder directory where contains the checkpoint.

File Structure

ChatVID/
|__config/
    |__...
|__model/
    |__...
|__scenic/
    |__...
|__simclr/
    |__...
|__pretrained_models/
    |__grit_b_densecap_objectdet.pth
|__vicuna-7b/
    |__pytorch_model-00001-of-00002.bin
    |__pytorch_model-00002-of-00002.bin
    |__...
|__vid2seq_ckpt/
    |__checkpoint_200001
|__clip_ckpt/
    |__ViT-L-14.pt
|__app.py
|__README.md
|__pre-requirements.txt
|__requirements.txt
|__extra-requirements.txt
|__LICENSE

Gradio WebUI Usage 🌐

# change all the abs path in config/infer.yaml
python app.py

Acknowledgment

This work is based on Vicuna, BLIP-2, GRiT, Vid2Seq, Whisper. Thanks for their great work!

About

Chat about anything on any video!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published