This is the PyTorch Implementation of our paper "Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering". (accepted by IJCAI’22)
Ubuntu 14.04
Python 3.7
CUDA10.1
CuDNN7.5+
pytorch>=1.7.0
- Download the dataset
MSVD-QA: link
MSRVTT-QA: link
TGIF-QA: link - Preprocessing
- To extract questions or answers Glove Embedding, please ref here.
Take the action task in TGIF-QA dataset as an example, we have features at the path /QAfeatures: TGIF/word/action/TGIF_action_train_questions.pt TGIF/word/action/TGIF_action_val_questions.pt TGIF/word/action/TGIF_action_test_questions.pt TGIF/word/action/TGIF_action_vocab.json - To extract appearance and motion feature, use the pretrained models here.
for the action task, we have features at the path /Vfeatures:
TGIF/SpatialFeatures/tumblr_nd24xaX8d11qkb1azo1_250/Features.pkl
(shape is 2^level-1,16,2048)
TGIF/SpatialFeatures/tumblr_no00ddSlG31t34v14o1_250/Features.pkl
...
TGIF/TemporalFeatures/tumblr_nd24xaX8d11qkb1azo1_250/Features.pkl
(shape is 2^level-1,2048)
TGIF/TemporalFeatures/tumblr_no00ddSlG31t34v14o1_250/Features.pkl
...
In our paper, number of levels is set to 3 by default.
- To extract questions or answers Glove Embedding, please ref here.
The trained models for the action task can be downloaded from here.
@article{peng2022MHN,
title={Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering},
author={Peng Min, Wang Chongyang, Gao Yuan, Shi Yu, Zhou Xiang-Dong},
journal={Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI)},
year={2022}}