The repository contains the majority of the code to reproduce the experimental results of the paper Structured Attentions for Visual Question Answering on the VQA-1.0 and VQA-2.0 dataset. Currently only the accelerated version of Mean Field is provided, which is used in the VQA 2.0 challenge.
To reproduce the experimental results,
-
Clone and compile mxnet, with mxnet@c9e252, cub@89de7ab, dmlc-core@3dfbc6, nnvm@d3558d, ps-lite@acdb69, mshadow@8eb1e0. There has been some modification on optimizers (and others) in later versions of mxnet, and code in this repository has not been adapted yet.
-
ResNet-152 feature of MS COCO images: extracted with MCB's preprocess code.
-
Our training question and answer data for VQA2.0: Baidu Pan.
Set the arguments and run train_VQA.py
.
The best single model accuracy on test-dev
of VQA-1.0 and VQA-2.0 with skip-thought vector initialization and Visual Genome training data are 67.19 and 64.78 respectively. Here is the model on VQA-2.0.
If you found this repository helpful, you could cite
@article{chen2017sva,
title={Structured Attentions for Visual Question Answering},
author={Chen, Zhu and Yanpeng, Zhao and Shuaiyi, Huang and Kewei, Tu and Yi, Ma},
journal={IEEE International Conference on Computer Vision (ICCV)},
year={2017},
}
This code is distributed under MIT LICENSE.