This is a paper list for the multimodal dialogue systems topic.
Keyword: Multi-modal, Dialogue system, visual, conversation
(1) Visual QA VQA datasets in CVPR2021,2020,2019,..., containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.
- VQA datasets 1.0 2.0
- TextVQA TextVQA requires models to read and reason about text in an image to answer questions based on them. In order to perform well on this task, models need to first detect and read text in the images. Models then need to reason about this to answer the question.
- TextCap TextCaps requires models to read and reason about text in images to generate captions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it and visual content in the image to generate image descriptions.
- Issues :
- visual-explainable: the model should rely on the right visual regions when making decisions,
- question-sensitive: the model should be sensitive to the linguistic variations in question
- reduce language biases: the model should not take the language shortcut to answer the question without looking at the image
- Further Papers (too many)
- cross-modal interaction /fusion
- Multimodal Neural Graph Memory Networks for Visual Question Answering ACL2020
- Bottom-up and top-down attention for image captioning and visual question answering in CVPR2018, winner of the 2017 Visual Question Answering challenge
- Multimodal Neural Graph Memory Networks for Visual Question Answering ACL2020, visual features + encoded region-grounded captions (of object attributes and their relationships) = two graph nets which compute question-guided contextualized representation for each, then the updated representations are written to an external spatial memory (??what's that??).
- Cross-Modality Relevance for Reasoning on Language and Vision in ACL2020
- Hypergraph Attention Networks for Multimodal Learning CVPR2020
- Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? EMNLP2016
- Multi-level Attention Networks for Visual Question Answering CVPR2017
- Hierarchical Question-Image Co-Attention for Visual Question Answering CVPR2016
- vision-language pretraining / representation learning
- VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv2019, ground element of language to image regions with self-attention
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks NeuIPS2019
- VL-BERT: Pre-training of Generic Visual-Linguistic Representations [Code] ICRL2020
- VinVL: Making Visual Representations Matter in Vision-Language Models [Code] CVPR2021
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision [Code] ICML 2021
- 12-in-1: Multi-Task Vision and Language Representation Learning [Code] CVPR2020
- Unified Vision-Language Pre-Training for Image Captioning and VQA [Code] AAAI2020
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers [Code] EMNLP2019
- Adaptive Transformers for Learning Multimodal Representations[Code] SRW ACL2020
- Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer [Data Code] ACL2020
- Language prior issue
- AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss in a perspective of feature space learning (not classification task)
- Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering CVPR2017 VQA 2.0 is also for the purpose of balance language prior to images
- Self-Critical Reasoning for Robust Visual Question Answering NeurIPS2019
- Overcoming Language Priors in Visual Question Answering with Adversarial Regularization NeurIPS2018, question-only model
- RUBi: Reducing Unimodal Biases in Visual Question Answering NeurIPS2019 also question-only model
- Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [Code CVPR2018
- Counterfactual VQA: A Cause-Effect Look at Language Bias [Code] CVPR2021
- Counterfactual Vision and Language Learning CVPR2020
- Visual-explainable issue
- Counterfactual Samples Synthesizing for Robust Visual Question Answering CVPR2020
- Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering EMNLP2020
- Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision ECCV2020 leveraging overlooked supervisory signal found in existing datasets to improve generalization capabilities
- Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention arXiv2019
- Towards Transparent AI Systems: Interpreting Visual Question Answering Models 2016
- object relation reasoning / visual understanding / cross-modal / Graphs
- MUREL: Multimodal Relational Reasoning for Visual Question Answering CVPR2019, [Code], represent and refine interactions between question words and image regions, more fine than attention-maps
- CRA-Net: Composed Relation Attention Network for Visual Question Answering ACM2019 object relation reasoning attention should look at both visual (features, spatial) and linguistic (in questions) features 不让看哦?
- Hierarchical Graph Attention Network for Visual Relationship Detection CVPR2020 object-level graph: (1) woman (sit on) bench, (2) woman (in front of) water; triplet-level graph: relation between triplet(1) and triplet(2)
- Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations IEEE2021, relational visual-linguistic BERT
- Relation-Aware Graph Attention Network for Visual Question Answering ICCV2019, explicit relations of geometric positions and semantic interactions between objects, implicit relations of hidden dynamics between image regions
- Fusion of Detected Objects in Text for Visual Question Answering EMNLP2020
- GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering arXiv2021
- A Simple Baseline for Visual Commonsense Reasoning ViGil@NeuIPS2019
- Learning Conditioned Graph Structures for Interpretable Visual Question Answering [Code] NeuIPS2018
- Graph-Structured Representations for Visual Question Answering CVPR2017
- R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering [Code] ACM KDD2018
- Knowledge / cross-modal fusion / Graphs
- Towards Knowledge-Augmented Visual Question Answering Coling2020, capture the interactions between objects in a visual scene and entities in an external knowledge source, with many many graphs ...
- ConceptBert: Concept-Aware Representation for Visual Question Answering EMNLP2020, learn a joint Concept-Vision-Language embedding (maybe similar to [this paper] in the way of adding "entity embedding" ?)
- Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks 2017
- text in the image (TextCap & TextVQA)
- Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text [Code] CVPR2020, the printed text on the bottle is the brand of the drink ==> graph representation of the image should have sub-graphs and respective aggregators to pass messages among graphs (我不知道我在说什么???)
- Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval arXiv2020, common semantic space between salient objects and text found in an image
- Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps arXiv2020, simple attention mechanism is, good
- Cascade Reasoning Network for Text-based Visual Question Answering ACM2020, 1) which info's useful, 2)question related to text but also visual concepts, how to capture cross-modal relathionships, 3)what if OCR fails
- TAP: Text-Aware Pre-training for Text-VQA and Text-Caption arXiv2020, incorporates OCR generated text in pre-training
- Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA arXiv2020
- multi-task
- cross-modal interaction /fusion
(2) Visual Dialog CVPR 2017, Open-domain dialogs & given an image, a dialog history, and a follow-up question about the image, the task is to answer the question.
- VisDial v1.0 dataset [Paper] [Source Code to collect chat data]
- Further papers
- reasoning
- KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue ACM2020, here knowledge = text knowledge & vision knowledge, encoding (T2V graph & V2T graph) then bridging (update graph nodes) then storing then retrieving (via adaptive information selection mode)
- Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog ACL2019, iteratively refine the question's representation based on image and dialog history
- Recursive visual attention in visual dialog CVPR2019 [Code]
- DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog AAAI2020
- Visual Reasoning with Multi-hop Feature Modulation [Code] ECCV2018
- VisualCOMET: Reasoning About the Dynamic Context of a Still Image [Code] ECCV2020
- understanding
- DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue [Code] AAAI2020
- Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue [Code] IEEE2021
- Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision [Code] EMNLP2020
- coreference
- reference
- cross-modal / fusion / joint / dual ...
- Efficient Attention Mechanism for Handling All the Interactions between Many Inputs with Application to Visual Dialog ECCV2019
- Image-Question-Answer Synergistic Network for Visual Dialog CVPR2019
- DialGraph: Sparse Graph Learning Networks for Visual Dialog arXiv
- All-in-One Image-Grounded Conversational Agents arXiv2019
- Visual-Textual Alignment for Graph Inference in Visual Dialog Coling2020
- Connecting Language and Vision to Actions ACL2018
- Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries [Code] CVPR2018
- Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems WWW2019
- Reactive Multi-Stage Feature Fusion for Multimodal Dialogue Modeling 2019
- Two Causal Principles for Improving Visual Dialog [Code] CVPR2020
- Learning Cross-modal Context Graph for Visual Grounding [Code AAAI2020
- Multi-View Attention Networks for Visual Dialog [Code] arXiv2020
- Efficient Attention Mechanism for Visual Dialog that Can Handle All the Interactions Between Multiple Inputs [Code] ECCV2020
- Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features [Code in TF] ECCV2018
- use dialog history / user guided
- knowledge
- The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents ACL2020
- Knowledge-aware Multimodal Dialogue Systems ACM2018
- A Knowledge-Grounded Multimodal Search-Based Conversational Agent [Code wow finally a code about "knowledge" or "graph"] SCAI@EMNLP2018
- modality bias
- Modality-Balanced Models for Visual Dialogue AAAI2020
- Training data-efficient image transformers & distillation through attention [Code] arXiv2020
- Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning EMNLP2020
- Visual Dialogue without Vision or Dialogue 2018
- Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision EMNLP findings 2020
- Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models 2021
- pretraining / representation learning / bertologie
- VD-BERT: A Unified Vision and Dialog Transformer with BERT [Code] EMNLP2020
- Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline [Code] ECCV2020
- Kaleido-BERT: Vision-Language Pre-training on Fashion Domain [Code] arXiv2021
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [Code] ECCV2020
- 12-in-1: Multi-Task Vision and Language Representation Learning [Code] CVPR2020
- Large-Scale Adversarial Training for Vision-and-Language Representation Learning [Code] NeurIPS 2020
- Integrating Multimodal Information in Large Pretrained Transformers [Code] ACL2020
- Generative dialogue / diverse
- Improving Generative Visual Dialog by Answering Diverse Questions EMNLP 2019, [Code]
- Visual Dialogue State Tracking for Question Generation [Code is in the series of guesswhat guesswhich visdial] AAAI2020
- MultiDM-GCN: Aspect-Guided Response Generation in Multi-Domain Multi-Modal Dialogue System using Graph Convolution Network EMNLP2020
- Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model [Code] NIPS2017
- FLIPDIAL: A Generative Model for Two-Way Visual Dialogue CVPR2018
- DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue IJCAI2020 [Code soon]
- More to diverse: Generating diversified responses in a task oriented multimodal dialog system 2020
- Multimodal Dialog System: Generating Responses via Adaptive Decoders ACM2019
- Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation IJCNLP2017
- Multimodal Differential Network for Visual Question Generation [Code] EMNLP2018
- Generative Visual Dialogue System via Adaptive Reasoning and Weighted Likelihood Estimation 2019
- Aspect-Aware Response Generation for Multimodal Dialogue System ACM 2021
- An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games EACL2021
- Adversarial training
- RL
- Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning ICCV2017 oral, [Code]
- Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog SIGDIAL 2018
- Multimodal Dialog for Browsing Large Visual Catalogs using Exploration-Exploitation Paradigm in a Joint Embedding Space ICMR2019
- Recurrent Attention Network with Reinforced Generator for Visual Dialog ACM 2020
- linguistic / probabilistic
- reasoning
(3) CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog NAACL2019, [code]
- Further paper
(4) Open-domain:
- OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
- The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue ACL2019
- A Visually-Grounded Parallel Corpus with Phrase-to-Region Linking LREC2020
- Papers
- Multi-Modal Open-Domain Dialogue 2020
- Open Domain Dialogue Generation with Latent Images 2020
- [Image-Chat: Engaging Grounded Conversations] ACL2020
- The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents ACL2020
(?) sentiment
- MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations Coling2020
- Bridging Dialogue Generation and Facial Expression Synthesis 2019
(5) Task/Goal-oriented:
- CRWIZ: A Framework for Crowdsourcing Real-Time Wizard-of-Oz Dialogues LREC2020
- A Corpus for Reasoning About Natural Language Grounded in Photographs ACL2019
- CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication [Data] ACL2019
- AirDialogue: An Environment for Goal-Oriented Dialogue Research EMNLP2018
- ReferIt [paper] in EMNLP2014, 2-players game of refer & label
- Papers
- Answerer in Questioner's Mind for Goal-Oriented Visual Dialogue [Code] NeurIPS 2018
- End-to-end optimization of goal-driven and visually grounded dialogue systems IJCAI2017
- Learning Goal-Oriented Visual Dialog via Tempered Policy Gradient and Code IEEE2018
- Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue [Code] ACM MM 2020
- Building Task-Oriented Visual Dialog Systems Through Alternative Optimization Between Dialog Policy and Language Generation EMNLP2019
- Storyboarding of Recipes: Grounded Contextual Generation [Script Data] DGS@ICLR2019
- Gold Seeker: Information Gain From Policy Distributions for Goal-Oriented Vision-and-Langauge Reasoning CVPR2020
(6) evaluation - A Revised Generative Evaluation of Visual Dialogue [Code] arXiv2020 - Evaluating Visual Conversational Agents via Cooperative Human-AI Games [Code for GuessWhich] 2017 - The Interplay of Task Success and Dialogue Quality: An in-depth Evaluation in Task-Oriented Visual Dialogues EACL2021
(7) classification
- GuessWhat?! Visual Object Discovery Through Multi-Modal Dialogue in CVPR2017, a two-player guessing game (1 oracle & 1 questioner).
- [Code]
- Further paper
- End-to-end optimization of goal-driven and visually grounded dialogue systems Reinforcement Learning applied to GuessWhat?!
- Guessing State Tracking for Visual Dialogue ECCV2020
- [Language-Conditioned Feature Pyramids for Visual Selection Tasks] EMNLP2020 [Code]
- Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat NAACL2019
- Interactive Classification by Asking Informative Questions [Code] ACL2020
(?) Others
- Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concepts ACL2020
- How2: A Large-scale Dataset for Multimodal Language Understanding [Data] NIPS2018
(8) [Image caption] generating natural language description of an image
- MS COCO dataset 2014 Images + captions (but captions are single words not sentences)
- Further papers
- Feature images as a whole / and regions (early approachs) :
- Attention based approaches :
- Bottom-up and top-down attention for image captioning and visual question answering in CVPR2018, winner of the 2017 Visual Question Answering challenge
- Show, attend and tell: Neural image caption generation with visual attention in ICML2015
- Review networks for caption generation NIPS2016
- Image captioning with semantic attention CVPR2016
- Graph structured approaches :
- Reinforcement learning:
- Transformer based:
- Image captioning: transform objects into words in NIPS2019 using Transformers focusing on objects and their spatial relationships
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning in ACL2018, also a dataset
- Improving Image Captioning with Better Use of Caption ACL2020
- Improving Image Captioning Evaluation by Considering Inter References Variance ACL2020
(9) Navigation task
- Talk the walk: Navigating new york city through grounded dialogue
- [A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses] EMNLP2020
- navigating
- [Improving Vision-and-Language Navigation with Image-Text Pairs from the Web] ECCV2020
- [Diagnosing Vision-and-Language Navigation: What Really Matters] arXiv2021
- [Vision-Dialog Navigation by Exploring Cross-Modal Memory] CVPR2020
- [Vision-and-Dialog Navigation] CoVR 2019
- [Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments] CVPR2018
- [Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters] 2020
- [Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation] ACL2019
- [Active Visual Information Gathering for Vision-Language Navigation] ECCV2020
- [Environment-agnostic Multitask Learning for Natural Language Grounded Navigation] ECCV2020
- [Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation] 2019
- [Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation] CVPR2019
- [Engaging Image Chat: Modeling Personality in Grounded Dialogue] 2018
- [TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments] CVPR2019
- [Multi-modal Discriminative Model for Vision-and-Language Navigation] 2019
- [REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments] CVPR 2020
- [Learning To Follow Directions in Street View] AAAI2020
- [Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning] ViGil@NeuIPS2019
- representation learning
- Grounding
- Words Aren't Enough, Their Order Matters: On the Robustness of Grounding Visual Referring Expressions ACL2020
- Grounding Conversations with Improvised Dialogues ACL2020
- A negative case analysis of visual grounding methods for VQA ACL2020
- Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms ACL2020
- Where Are You? Localization from Embodied Dialog [Code] EMNLP2020
- Visual Referring Expression Recognition: What Do Systems Actually Learn? NAACL2018
- Ask No More: Deciding when to guess in referential visual dialogue coling2018
- Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts EMNLP2020
- Achieving Common Ground in Multi-modal Dialogue ACL2020
- navigating
(10) retrieval task
- image retrieval/visual retrieval
- Exploring Phrase Grounding without Training: Contextualisation and Extension to Text-Based Image Retrieval CVPRW2020
- Toward General Scene Graph: Integration of Visual Semantic Knowledge with Entity Synset Alignment [Code wow finally a code for graph] ALVR2020
- Dialog-based Interactive Image Retrieval [Code Fashion retrieval] NeuIPS2018
- I Want This Product but Different : Multimodal Retrieval with Synthetic Query Expansion 2021
- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers 2021
(11) image editing / text-to-image
- [Sequential Attention GAN for Interactive Image Editing] ACM2020
- [Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction] ICCV2019
- [ChatPainter: Improving Text to Image Generation using Dialogue] ICLR2018
- [Adversarial Text-to-Image Synthesis: A Review] 2021
- [A Multimodal Dialogue System for Conversational Image Editing] 2020
(12) Fashion 🌟🌟🌟 ----F-a-s-h-i-o-n----
-
SIMMC - Domains include furniture and fashion 🌟🌟🌟, it can be seen as a variant of multiWOZ or schema guided dialogue dataset
- Situated and Interactive Multimodal Conversations EMNLP2020 [SIMMC 1.0] in Coling2020, [SIMMC 2.0], track in DSTC9 and DSTC10
- [Code]
- Further papers
- A Response Retrieval Approach for Dialogue Using a Multi-Attentive Transformer second winner DSTC9 SIMMC fashion, [code]
- Overview of the Ninth Dialog System Technology Challenge: DSTC9 to better see the winners' models
- [Code winner1 TNU](有点乱), [Code winner2 SU, [Code other]
-
Fashion IQ in CVPR2020 workshop, [paper] [dataset & startkit]
-
MMD Towards Building Large Scale Multimodal Domain-Aware Conversation Systems, arXiv 2017, [code], [Multimodal Dialogs (MMD): A large-scale dataset for studying multimodal domain-aware conversations] 2017
(13) video
- Audio Visual Scene-Aware Dialog Track in DSTC8 [[Paper]((https://ieeexplore.ieee.org/document/8953254)] [[site]]((https://video-dialog.com/)
- [CMU Sinbad’s Submission for the DSTC7 AVSD Challenge]
- [DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style Word Generator] 2020
- [A Simple Baseline for Audio-Visual Scene-Aware Dialog] CVPR2019
- [TVQA] [MovieQA] [TGif-QA]
- TVQA+: Spatio-Temporal Grounding for Video Question Answering ACL2020
- [MultiSubs: A Large-scale Multimodal and Multilingual Dataset] 2021
- [Adversarial Multimodal Network for Movie Question Answering] 2019
- [What Makes Training Multi-Modal Classification Networks Hard?] CVPR2020
- DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue 2021
- Minecraft
- video & QA/Dialog papers
- representation learning
- VideoBERT: A Joint Model for Video and Language Representation Learning
- Learning Question-Guided Video Representation for Multi-Turn Video Question Answering ViGil@NeuIPS2019
- Video Dialog via Progressive Inference and Cross-Transformer EMNLP2019
- [Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems] ACL2019
- [Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog] 2020
- Video-Grounded Dialogues with Pretrained Generation Language Models ACL2020
- Graph
- Location-Aware Graph Convolutional Networks for Video Question Answering
- Object Relational Graph With Teacher-Recommended Learning for Video Captioning
- Fusion
- End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features IEEE2019
- [See the Sound, Hear the Pixels] IEEE2020
- [Video Dialog via Multi-Grained Convolutional Self-Attention Context Networks] SIGIR2019
- [Video Dialog via Multi-Grained Convolutional Self-Attention Context Multi-Modal Networks] IEEE2020
- [Game-Based Video-Context Dialogue] EMNLP2018
- [Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks] IEEE2019
- [End-to-End Multimodal Dialog Systems with Hierarchical Multimodal Attention on Video Features] 2018
- representation learning
(14) LEAF-QA: Locate, Encode & Attend for Figure Question Answering
(15) MOD Meme incorporated Open Dialogue WeChat conversations with meme / stickers in Chinese language.
- A Multimodal Memes Classification: A Survey and Open Research Issues
- [Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog] WWW2020
- [Learning to Respond with Your Favorite Stickers: A Framework of Unifying Multi-Modality and User Preference in Multi-Turn Dialog] 2020
- [The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes] NeuIPS2020
- Multimodal Research in Vision and Language: A Review of Current and Emerging Trends arXiv2020
- Transformers in Vision: A Survey arXiv2021
- Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods JAIR2020
- Multimodal-Dialogue-PaperList
- Awesome-Visual-Transformer
- Transformer-in-Vision
- awesome-multimodal-ml
- awesome-visual-question-answering
- awesome-vqa-latest
- awesome-visual-dialog
- Awesome-Scene-Graphs
- awesome-vln
- Tasks
- Visual Question Answering,
- Visual dialog
- Visual Commonsense Reasoning,
- Image-Text Retrieval,
- Referring Expression Comprehension,
- Visual Entailment
- NL+V representation ==> multimodal pretraining
- Issues / topics:
- text and image bias
- VL or LV bertologie
- visual understanding / reasoning / object relation
- cross-modal text-image relation (attention on interaction)
- incorporate knowledge / common sense (attention on knowledge)
- Often used model-elements :
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks 2015
- LSTM
- GANs
- Transformers
- Graphs : attention graph, GCN, memory graph .........
- often mentioned approaches:
- adversarial training
- reinforcement learning
- graph neural network
- joint learning / parel / Dual encoder / Dual attention
- my questions
- what does "adaptive" mean? why everyone likes this specific word?
- "ground", mysterious word too...
- often can't find many codes for papers with "graph" or "reinforcement learning" in title ???