Skip to content

Latest commit

 

History

History
executable file
·
94 lines (80 loc) · 5.75 KB

README.md

File metadata and controls

executable file
·
94 lines (80 loc) · 5.75 KB

🐼 Panda-70M: Video Captioning

[Note] To run the captioning code, please make sure you follow this guideline and correctly prepare vicuna-7b-v0 weight. You need to first download the original weights and then apply delta weights. Improper weights preparation will lead to meaningless outputs.

Introduction

We propose a video captioning model to generate a caption for a short video clip. The model includes vision (green) and textual (blue) branches to benefit video captioning by both video and text inputs. We release the checkpoint trained on Panda-70M.

Preparations

Setup Repository and Enviroment

git clone https://github.com/snap-research/Panda-70M.git
cd Panda-70M/captioning

# create a conda environment
conda create --name panda70m_captioning python=3.9 -y
conda activate panda70m_captioning
pip install -r requirements.txt

# install default JRE
apt update
apt install default-jre

Download Checkpoint

You can manually download the file here (3.82GB) and move it to the checkpoint folder or run:

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5" -O checkpoint/checkpoint_best.pth && rm -rf /tmp/cookies.txt

Prepare Vicuna:

  • Please follow the intructions from FastChat to install vicuna-7b-v0 weight.
  • [Note] You need to apply delta weights and after processed, the weights should be moved to vicuna_weights/vicuna-7b-v0 folder with the file list like this.

Quick Demo

python inference.py --video-list inputs/video_list.txt --prompt-list inputs/prompt_list.txt

The code will caption two test videos listed in the video_list.txt with the extra inputs of textual information from the prompt_list.txt. Here are some output examples:

Input Video Input Text Output Caption
Some information about a video you will get:
Transcription: Today we're gonna take a quick look at the 1966 Ford Mustang GT 289 v8 under the hood.
Metadata: ['Old VS New - 1966 Ford Mustang GT & 2018 Ford Mustang | Just a Quick Look', 'Lets check out this beautiful 1966 Ford Mustang GT 289 in the showroom with the 2018 Ford Mustang!']
Please look at the video and faithfully summarize it in one sentence.
A red mustang parked in a showroom with american flags hanging from the ceiling.
Please faithfully summarize the following video in one sentence. An aerial view of a city with a river running through it.

**We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request.

  • [Note] You might get different outputs due to the randomness of LLM's generation.

Evaluation

Zero-shot Captioning Performance

BLEU-4 ROUGE-L METEOR CIDEr BertScore
MSRVTT 25.4% 50.1% 27.7% 31.5% 87.9%
MSVD 32.8% 61.2% 35.3% 49.2% 90.2%
  • [Note] The results might not be perfectly reproduced due to the randomness of LLM's generation and could have an deviation of ±0.5%.

Prepare Testing Data

  • You can download the video samples here [MSRVTT / MSVD] and move them to test_datasets/video_samples/MSRVTT or MSVD folder.
  • The caption annotations of the testing samples are already saved in test_datasets/anno_downstream folder.

Evaluation

# MSRVTT
python inference.py --video-list test_datasets/video_list/msrvtt_test.txt --output-json msrvtt_caption.json
python compute_results.py --predict-json msrvtt_caption.json --target-json test_datasets/anno_downstream/msrvtt_caption_test.json

# MSVD
python inference.py --video-list test_datasets/video_list/msvd_test.txt --output-json msvd_caption.json
python compute_results.py --predict-json msvd_caption.json --target-json test_datasets/anno_downstream/msvd_caption_test.json

Acknowledgements

The code for video captioning is built upon Video-LLaMA. Thanks for sharing the great work!