Skip to content

Latest commit

 

History

History
130 lines (93 loc) · 4.69 KB

README.md

File metadata and controls

130 lines (93 loc) · 4.69 KB

Composed Query-Based Event Retrieval in Video Corpus with Multimodal Episodic Perceptron

PyTorch implementation for the paper "Composed Query-Based Event Retrieval in Video Corpus with Multimodal Episodic Perceptron"

TVR-CQ Video Features

We propose a novel event retrieval framework termed Composed Query-Based Event Retrieval (CQBER), simulating the multi-modal perception ability of humans to improve accuracy in the retrieval process. Specifically, we first construct two CQBER benchmark datasets, namely ActivityNet-CQ and TVR-CQ, which cover TV shows and open-world scenarios,respectively. Additionally, we propose an initial CQBER method, termed Multimodal Episodic Perceptron (MEP),which excavates complete query semantics from both observed static visual cues and various descriptions. Extensive experiments demonstrate that our proposed framework significantly boosts event retrieval accuracy across different existing methods.

Datasert Overview

Figure 1: Supp. We compare our proposed TVR-CQ and ActivityNet-CQ datasets with original TVR and ActivityNet-Captions datasets in details.Perceptron

Framework

Figure 2: An overview of the CQBER framework based on our proposed Multimodal Episodic Perceptron

Performance

Figure 3: Supp. Additional ablation studies regarding the key model components on the TVR-CQ datasetPerceptron

Visualization

Figure 4: Visualizations of event retrieval results using our MEP method on the TVR-CQ dataset

Figure 5: Visualizations of episodic perception in composed queries. Here we adopt the attention from the last VLCU layer.

Figure 6

Figure 6: More visualizations of episodic perception and event retrieval results using our MEP method on the TVR-CQ dataset.

The codes are modified from ReLoCLNet

Prerequisites

  • python 3.x with pytorch (1.7.0), torchvision, transformers, tensorboard, tqdm, h5py, easydict
  • cuda, cudnn

If you have Anaconda installed, the conda environment of ReLoCLNet can be built as follows (take python 3.7 as an example):

conda create --name CQBER python=3.7
conda activate CQBER
conda install -c anaconda cudatoolkit cudnn  
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=11.0 -c pytorch
conda install -c anaconda h5py=2.9.0
conda install -c conda-forge transformers tensorboard tqdm easydict

The conda environment of TVRetrieval also works.

Getting started

  1. Clone this repository
  2. Download features

For the features of TVR dataset, please download it from features and extrct it to the features directory:

$ tar -xjf video_feats.tar.bz2 -C features 

This link may be useful for you to directly download Google Drive files using wget.interested.

  1. Add project root to PYTHONPATH (Note that you need to do this each time you start a new session.)
$ source setup.sh

Training and Inference

TVR dataset

# train, refer `method_tvr/scripts/TVR_CQ_train.sh` and `method_tvr/config.py` more details about hyper-parameters
$ bash method_tvr/scripts/TVR_CQ_train.sh tvr video_sub_tef resnet_i3d --exp_id CQBER
# inference
# the model directory placed in method_tvr/results/tvr-video_sub_tef-CQBER-*
# change the MODEL_DIR_NAME as tvr-video_sub_tef-CQBER-*
# SPLIT_NAME: [val | test]
$ bash method_tvr/scripts/inference.sh MODEL_DIR_NAME SPLIT_NAME

TODO

  • Upload codes for ActivityNet Captions dataset