This repository contains code and experiments for MultiViz, a general-purpose framework for visualizing and interpreting multimodal models. Check out the MultiViz visualization website.
Correspondence to:
- Paul Pu Liang ([email protected])
- Yiwei Lyu ([email protected])
- Gunjan Chhablani ([email protected])
- Nihal Jain ([email protected])
- Zihao Deng ([email protected])
- Xingbo Wang ([email protected])
- Louis-Philippe Morency ([email protected])
- Ruslan Salakhutdinov ([email protected])
MultiViz: Towards Visualizing and Understanding Multimodal Models
Paul Pu Liang, Yiwei Lyu, Gunjan Chhablani, Nihal Jain, Zihao Deng, Xingbo Wang, Louis-Philippe Morency, Ruslan Salakhutdinov
ICLR 2023.
If you find this repository useful, please cite our paper:
@inproceedings{liang2023multiviz,
title={MultiViz: Towards Visualizing and Understanding Multimodal Models},
author={Liang, Paul Pu and Lyu, Yiwei and Chhablani, Gunjan and Jain, Nihal and Deng, Zihao and Wang, Xingbo and Morency, Louis-Philippe and Salakhutdinov, Ruslan},
booktitle={International Conference on Learning Representations},
year={2023}
}
We designed a structured framework that allows easy analysis and visualization of multimodal models. To run anything within the structured framework, you should have structured-framework/
as your working directory.
In the structured framework, there are 4 main modules: datasets, models, analysis and visualizations.
Each script in the datasets module loads data points from a specific dataset, and also interfaces all necessary information that directly comes from the dataset (such as label-id-to-answer mappings). To add a new dataset to the framework, simply add another file under this module following the same format as the existing ones.
Each script in the models module contains a wrapper class for a specific multimodal model for a particular dataset. All such classes should be subclass of the AnalysisModel class defined in models/analysismodel.py
, which specifies certain functionalities the class must implement such as forward
, getpredlabel
, etc. To add a new model to the framework, add another file under this module and write a wrapper class for your model that extends the AnalysisModel class.
Under the analysis module there are scripts that runs various analysis methods on arbitrary dataset and model combinations. This is enabled by calling on common functionalities specified in AnalysisModel class. These scripts outputs raw analysis results (usually just a bunch of numbers), and scripts in the visualizations module are tools to create visualizations of these raw results. To add additional analysis and visualization methods, simply add some functions to these modules.
To download the dataset, you need to download the following things from these urls and unzip all in a folder named structured-framework/data
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip
wget http://images.cocodataset.org/zips/val2014.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip
Here are a few example scripts on running analysis methods on VQA with LXMERT model:
(1) Unimodal LIME: structured-framework/examples/vqa_lxmert_unimodallime.py
(2) EMAP: structured-framework/examples/vqa_lxmert_emap.py
(3) DIME: structured-framework/examples/vqa_lxmert_dime.py
(4) Sparse Linear Model together with local and global representation feature analysis: structured-framework/examples/vqa_lxmert_sparselinearmodel.py
(5) Global representation feature analysis on all features: structured-framework/examples/vqa_lxmert_featureonly.py
(6) No analysis, just showing the original question and image and correct/predicted answer in one image: structured-framework/examples/vqa_lxmert_show.py
Note that the version of LXMERT used is directly from the HuggingFace Transformers pipeline, which contains a bug in image preprocessing that flips the red and blue values of pixels. To use the bug-free version, simply comment out line 571 of structured-framework/models/lxmert_extras/utils.py
To download the dataset, run the following command. Then unzip data into a folder named structured-framework/data
wget https://dl.fbaipublicfiles.com/clevr/CLEVR_v1.0.zip
unzip CLEVR_v1.0.zip
We use both MDETR and CNNLSTMSA models on the CLEVR dataset. These are pretrained models from the MDETR github repository and clevr-iep github repository, respectively. The MDETR model achieves nearly prefect score on the CLEVR dataset, whereas CNNLSTMSA is a baseline proposed in the original CLEVR paper.
Below are a few example scripts on running analysis methods on CLEVR with MDETR model. To use the CNNLSTMSA model, you should create a new conda environment with python3.5 and install all the requirements from the clevr-iep github repository.
(1) First Order Gradient: structured-framework/examples/clevr-mdetr-gradient.py
(2) Sparse Linear Model together with local and global representation feature analysis:structured-framework/examples/clevr-mdetr-slm.py
This repository contains a processed version of CMU-MOSEI validation split in structured-framework/data/MOSEI/mosei_valid_processed_features_list.pkl
(if not in the repo, it can be downloaded here).
If you would like to process the data yourself or visualize the analysis results in the form of videos using our provided methods, you need to download the original data from the link posted on the CMU-MultimodalSDK github: http://immortal.multicomp.cs.cmu.edu/raw_datasets/. Put the downloaded zip file into structured-framework/data/
.
This repository also contains a pretrained Multimodal Transformer for CMU-MOSEI in structured-framework/models/mult/mosei_sentimult_MULT_a_0.3_e_0.2_o_0.3_res_0.1.pt
. For more information about the MulT model and its training, refer to the original github repository.
Here are a few example scripts on running analysis methods on CMU-MOSEI with MulT model:
(1) First Order Gradient: structured-framework/examples/mosei_mult_gradient.py
(2) Sparse Linear Model together with local and global representation feature analysis: structured-framework/examples/mosei_mult_sparselinearmodel.py
(3) Second Order Gradient: structured-framework/examples/mosei_mult_secondordergradient.py
(4) Generate all analysis results and the json file: structured-framework/examples/mosei_mult_runandmakejson.py
The raw MM-IMDb dataset can be downloaded from here. The .hdf5
file downloaded can be used directly in the example scripts below.
The analyses presented in the paper use an LRTF model for predicting the genres of movies from their posters and movie synopsis. To train this model on the MM-IMDb data, follow the instructions here.
Here are a few example scripts for running the analyses presented in the paper with the LRTF model on the MM-IMDb dataset:
(1) First & Second Order Gradients: structured-framework/examples/imdb_lrtf_gradsaliency.py
(2) Sparse Linear Model training and visualization: structured-framework/examples/imdb_lrtf_sparselinearmodel.py
(3) Unimodal LIME: structured-framework/examples/imdb_lrtf_unimodallime.py
To download the Flickr30k dataset, follow the instructions shared on the official page provided in the Downloads section. If you are interested in evaluating the cross-modal interpretations against ground-truth annotations, the Flickr30k Entities annotations can be found on this GitHub repository.
Then unzip data into a folder named structured-framework/data
. It should have the following structure:
│ ├── data
│ │ └── flickr30k
│ │ ├── Annotations
│ │ ├── flickr30k-images
│ │ └── Sentences
The Annotations
and Sentences
folders come from the Flickr30k Entities dataset.
We use both CLIP and ViT models on the Flickr30k dataset, which are present on the HuggingFace model ecosystem. The ViLT model is fine-tuned on the Flicrk30k dataset.
Below are a few example scripts on running analysis methods on Flickr30k with ViLT model.
(1) First Order Gradient: structured-framework/examples/flickr30k_vilt_gradsaliency.py
(2) Second Order Gradient: structured-framework/examples/flickr30k_vilt_gradient.py
There are examples for the CLIP model present as well.
To download the dataset, follow the instructions at https://github.com/pliang279/MultiBench to obtain the im.pk
file. When loading the dataset in structured-framework/datasets/mimic.py
, set the path to the path to im.pk
.
To use the LF model, clone the MultiBench repository from the same link above and then follow instructions in there to train a LF model for MIMIC. Then input both the path to the saved checkpoint and the path to the cloned repository into the constructor of MIMICLF. See the examples below for details.
Here are a few example scripts on running analysis methods on MIMIC with LF model:
(1) Unimodal LIME: structured-framework/examples/mimic_lf_unimodallime.py
(2) EMAP: structured-framework/examples/mimic_lf_emap.py
(3) First Order Gradient: structured-framework/examples/mimic_lf_firstordergradient.py
MultiViz is designed in a way such that it can easily be extended to other datasets, models, and analysis methods.
If you wish to add your own dataset or model and use the existing analysis scripts in the repository, just follow the same format of existing dataset and model classes in this repository. All existing analysis scripts assumes that the model being analyzed implements the functions in models/analysismodel.py
, so if you write your model class implementing these functions, you can directly apply existing analysis scripts to your dataset and model.
If you wish to add your own analysis method and make it applicable to multiple existing datasets and models, simply make your analysis function takes a "analysismodel" object as input and only use the functions specified within models/analysismodel.py
to interact with the model, and the resulting script should be applicable to all currently existing datasets and models in this repository.