diff --git a/README.md b/README.md index bfe196e..abf7271 100644 --- a/README.md +++ b/README.md @@ -4,20 +4,18 @@ [![CI](https://github.com/souradipp76/MM-PoE/actions/workflows/main.yml/badge.svg)](https://github.com/souradipp76/MM-PoE/actions/workflows/main.yml) -**Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models** +**Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models** ## What is MM-PoE? Multi-Modal Process of Elimination (MM-PoE) is a method to enhance vision language models' performance on multiple-choice visual reasoning by employing a two-step scoring system that first eliminates incorrect options and then predicts from the remaining ones. Our experiments across three question answering datasets show the method's effectiveness, particularly in visual reasoning tasks. -**Statement of Need** - Large Language models (LLMs) excel at in-context learning for multiple choice reasoning tasks but often treat all options equally, unlike humans who typically eliminate incorrect choices before selecting the correct answer. Same is true for vision language models (VLMs) in case of visual question answering tasks with multiple choices. This discrepancy can limit the effectiveness of vision language models in accurately solving such tasks. To address this, we introduce Multi-Modal Process of Elimination (MM-PoE), a two-step scoring method designed to enhance VLM performance by mimicking human reasoning strategies in multi-modal settings. -In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios . Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs). +In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios. Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs). -By implementing MM-PoE, researchers and practitioners can experiment and significantly improve the accuracy and reliability of VLMs in multiple choice reasoning tasks, making it a valuable tool for advancing machine learning models for visual reasoning. +Using this tool, researchers and practitioners can experiment and significantly improve the accuracy and reliability of VLMs in multiple choice reasoning tasks, making it a valuable tool for advancing machine learning models for visual reasoning. ## Installing MM-PoE @@ -65,7 +63,10 @@ $ python -m mm_poe $ mm_poe ``` -The application will prompt the user to provide relevant inputs for a multiple choice question e.g a question, multiple answer choices for the question and the path to the image relevant the question context. Once the inputs are provided, the predicted answer will be displayed based on the selections. Note that this application runs inference for only a single sample at a time. +The application will prompt the user to provide relevant inputs for a multiple choice question e.g. a question, multiple answer choices for the question and the path to the image relevant the question context. Once the inputs are provided, the predicted answer will be displayed based prompt outputs. Note that this application runs inference for only a single sample at a time. + + +Example ### Running Experiments diff --git a/mm_poe/cli.py b/mm_poe/cli.py index 866701e..61cdaa0 100644 --- a/mm_poe/cli.py +++ b/mm_poe/cli.py @@ -65,7 +65,7 @@ def main(): ).ask() args.loading_precision = questionary.select( - message="Select model checkpoint?", + message="Select model precision?", choices=["FP32", "FP16", "BF16", "INT8", "INT4"], default="FP32", ).ask() @@ -116,7 +116,8 @@ def main(): "Image Path?", default="./images/image.png" ).ask() args.label = questionary.select( - message="Answer:", choices=[str(x) for x in range(args.num_options)] + message="Ground Truth Option:", + choices=[str(x) for x in range(args.num_options)], ).ask() args.label = int(args.label) args.method = "process_of_elimination" @@ -394,4 +395,4 @@ def main(): ) ) option = int(lm_predictions.numpy()[0]) - logger.info(f"Answer: {option}") + logger.info(f"Predicted Option: {option}. Answer: {args.choices[option]}") diff --git a/mm_poe/data/data_downloaders.sh b/mm_poe/data/data_downloaders.sh index 0d3f927..98ca44f 100644 --- a/mm_poe/data/data_downloaders.sh +++ b/mm_poe/data/data_downloaders.sh @@ -80,6 +80,9 @@ cd Annotations wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/Annotations_Train_mscoco.zip unzip Annotations_Train_mscoco.zip rm Annotations_Train_mscoco.zip +wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/Annotations_Val_mscoco.zip +unzip Annotations_Val_mscoco.zip +rm Annotations_Val_mscoco.zip mkdir ../Questions cd ../Questions wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/Questions_Train_mscoco.zip diff --git a/mm_poe/methods/utils/data.py b/mm_poe/methods/utils/data.py index 033e113..e433d24 100644 --- a/mm_poe/methods/utils/data.py +++ b/mm_poe/methods/utils/data.py @@ -505,7 +505,6 @@ def preprocess_function_causal_vqa_channel(examples, **kwargs): tokenizer = processor.tokenizer image_processor = processor.image_processor - ending_names = [k for k in examples.keys() if k.startswith("hypothesis")] num_choice = len(ending_names) question_headers = examples[header_name] first_sentences = [ @@ -1263,24 +1262,24 @@ def vqa_loader(path, args): examples = [] print("Loading annotations and questions...") - train_anno = json.load(open(ann_file, "r")) - train_ques = json.load(open(question_file, "r")) + anno = json.load(open(ann_file, "r")) + ques = json.load(open(question_file, "r")) if args.calibration_prompt is not None: uncond_premise = args.calibration_prompt else: uncond_premise = " the answer is:" - for i in range(len(train_anno["annotations"])): - ans = train_anno["annotations"][i]["multiple_choice_answer"] - img_id = train_anno["annotations"][i]["image_id"] + for i in range(len(anno["annotations"])): + ans = anno["annotations"][i]["multiple_choice_answer"] + img_id = anno["annotations"][i]["image_id"] # question_id = train_anno['annotations'][i]['question_id'] image_path = os.path.join( - img_dir, "COCO_train2014_" + "%012d.jpg" % img_id + img_dir, "COCO_%s2014_" % args.split + "%012d.jpg" % img_id ) - question = train_ques["questions"][i]["question"] - mc_ans = train_ques["questions"][i]["multiple_choices"] + question = ques["questions"][i]["question"] + mc_ans = ques["questions"][i]["multiple_choices"] label = mc_ans.index(ans) if getattr(args, "multiple_choice_prompt", None) is not None: diff --git a/mm_poe/methods/utils/methods.py b/mm_poe/methods/utils/methods.py index f745b62..7c3adf9 100644 --- a/mm_poe/methods/utils/methods.py +++ b/mm_poe/methods/utils/methods.py @@ -169,8 +169,8 @@ def inference_language_modeling( labels ) pbar.set_description( - f"Language modeling accuracy: {lm_accuracy:.4f},\ - Average language modeling accuracy: {avg_lm_accuracy:.4f}" + f"Language modeling accuracy: {lm_accuracy:.4f}, " + + f"Average language modeling accuracy: {avg_lm_accuracy:.4f}" ) avg_log_probs = torch.cat(avg_log_probs, dim=0) return avg_log_probs, lm_accuracy, avg_lm_accuracy, lm_predictions diff --git a/mm_poe/results/calibration1.csv b/mm_poe/results/calibration_old.csv similarity index 100% rename from mm_poe/results/calibration1.csv rename to mm_poe/results/calibration_old.csv diff --git a/mm_poe/results/channel1.csv b/mm_poe/results/channel_old.csv similarity index 100% rename from mm_poe/results/channel1.csv rename to mm_poe/results/channel_old.csv diff --git a/mm_poe/results/language_modeling.csv b/mm_poe/results/language_modeling.csv new file mode 100644 index 0000000..ebc51d0 --- /dev/null +++ b/mm_poe/results/language_modeling.csv @@ -0,0 +1,45 @@ +model_family,checkpoint,loading_precision,dataset,batch_size,method,seed,n_shot,sample,accuracy +GIT,microsoft/git-base-vqav2,FP32,ai2d,2,language_modeling,0,0,100,0.2500 +GIT,microsoft/git-base-vqav2,FP32,ai2d,2,language_modeling,1,0,100,0.2400 +GIT,microsoft/git-base-vqav2,FP32,ai2d,2,language_modeling,2,0,100,0.2700 +GIT,microsoft/git-base-vqav2,FP32,ai2d,2,language_modeling,3,0,100,0.2300 +GIT,microsoft/git-base-vqav2,FP32,ai2d,2,language_modeling,4,0,100,0.2800 +GIT,microsoft/git-base-textvqa,FP32,ai2d,2,language_modeling,0,0,100,0.2900 +GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,language_modeling,0,0,100,0.2600 +GIT,microsoft/git-base-textvqa,FP32,ai2d,2,language_modeling,0,0,100,0.2600 +GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,language_modeling,1,0,100,0.1300 +GIT,microsoft/git-base-textvqa,FP32,ai2d,2,language_modeling,1,0,100,0.2300 +GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,language_modeling,2,0,100,0.2500 +GIT,microsoft/git-base-textvqa,FP32,ai2d,2,language_modeling,2,0,100,0.2600 +GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,language_modeling,3,0,100,0.2500 +GIT,microsoft/git-base-textvqa,FP32,ai2d,2,language_modeling,3,0,100,0.2400 +GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,language_modeling,4,0,100,0.2000 +GIT,microsoft/git-base-textvqa,FP32,ai2d,2,language_modeling,4,0,100,0.3100 +GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,language_modeling,0,0,100,0.3000 +GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,language_modeling,1,0,100,0.2400 +GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,language_modeling,2,0,100,0.2800 +GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,language_modeling,3,0,100,0.3100 +GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,language_modeling,4,0,100,0.2400 +GIT,microsoft/git-base-vqav2,FP32,vqa,2,language_modeling,0,0,100,0.6000 +GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,average_language_modeling,0,0,100,0.1600 +GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,average_language_modeling,1,0,100,0.2000 +GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,average_language_modeling,2,0,100,0.2200 +GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,average_language_modeling,3,0,100,0.1800 +GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,average_language_modeling,4,0,100,0.1300 +GIT,microsoft/git-base-vqav2,FP32,ai2d,2,average_language_modeling,0,0,100,0.3000 +GIT,microsoft/git-base-vqav2,FP32,ai2d,2,average_language_modeling,1,0,100,0.2600 +GIT,microsoft/git-base-vqav2,FP32,ai2d,2,average_language_modeling,2,0,100,0.2200 +GIT,microsoft/git-base-vqav2,FP32,ai2d,2,average_language_modeling,3,0,100,0.2600 +GIT,microsoft/git-base-vqav2,FP32,ai2d,2,average_language_modeling,4,0,100,0.2700 +GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,average_language_modeling,0,0,100,0.1900 +GIT,microsoft/git-base-textvqa,FP32,ai2d,2,average_language_modeling,0,0,100,0.3100 +GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,average_language_modeling,1,0,100,0.2000 +GIT,microsoft/git-base-textvqa,FP32,ai2d,2,average_language_modeling,1,0,100,0.2800 +GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,average_language_modeling,2,0,100,0.2100 +GIT,microsoft/git-base-textvqa,FP32,ai2d,2,average_language_modeling,2,0,100,0.2300 +GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,average_language_modeling,3,0,100,0.2200 +GIT,microsoft/git-base-textvqa,FP32,ai2d,2,average_language_modeling,3,0,100,0.2800 +GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,average_language_modeling,4,0,100,0.2000 +GIT,microsoft/git-base-textvqa,FP32,ai2d,2,average_language_modeling,4,0,100,0.2800 +GIT,microsoft/git-base-textvqa,FP32,vqa,2,language_modeling,0,0,100,0.1900 +GIT,microsoft/git-base-textvqa,FP32,vqa,2,average_language_modeling,0,0,100,0.1800 diff --git a/mm_poe/results/language_modeling_old.csv b/mm_poe/results/language_modeling_old.csv new file mode 100644 index 0000000..feac5ed --- /dev/null +++ b/mm_poe/results/language_modeling_old.csv @@ -0,0 +1,12 @@ +model_family,checkpoint,loading_precision,dataset,batch_size,method,seed,n_shot,sample,accuracy +BLIP2,Salesforce/blip2-opt-2.7b,INT8,vqa,16,language_modeling,0,0,100,0.0600 +BLIP2,Salesforce/blip2-opt-2.7b,INT8,vqa,16,language_modeling,0,0,100,0.0600 +BLIP2,Salesforce/blip2-opt-2.7b,INT4,vqa,4,language_modeling,0,0,100,0.4300 +BLIP2,Salesforce/blip2-opt-2.7b,FP16,scienceqa,2,language_modeling,0,0,100,0.3800 +BLIP2,Salesforce/blip2-opt-2.7b,FP16,scienceqa,2,language_modeling,0,0,100,0.3800 +BLIP2,Salesforce/blip2-opt-2.7b,BF16,scienceqa,2,language_modeling,0,0,100,0.2200 +BLIP2,Salesforce/blip2-flan-t5-xl,FP16,scienceqa,2,language_modeling,0,0,100,0.3600 +BLIP2,Salesforce/blip2-opt-2.7b,FP16,ai2d,2,language_modeling,0,0,100,0.2300 +PaliGemma,google/paligemma-3b-ft-ai2d-448,FP16,ai2d,2,language_modeling,0,0,100,0.2100 +PaliGemma,google/paligemma-3b-ft-ai2d-448,FP16,ai2d,2,language_modeling,0,0,100,0.2100 +GIT,microsoft/git-base-vqav2,FP32,ai2d,2,language_modeling,0,0,100,0.2100 diff --git a/mm_poe/results/multiple_choice_prompt1.csv b/mm_poe/results/multiple_choice_prompt_old.csv similarity index 100% rename from mm_poe/results/multiple_choice_prompt1.csv rename to mm_poe/results/multiple_choice_prompt_old.csv diff --git a/mm_poe/results/process_of_elimination1.csv b/mm_poe/results/process_of_elimination_old.csv similarity index 100% rename from mm_poe/results/process_of_elimination1.csv rename to mm_poe/results/process_of_elimination_old.csv diff --git a/mm_poe/results/vision_language_modeling.csv b/mm_poe/results/vision_language_modeling.csv deleted file mode 100644 index 8dea0a7..0000000 --- a/mm_poe/results/vision_language_modeling.csv +++ /dev/null @@ -1,23 +0,0 @@ -model_family,checkpoint,loading_precision,dataset,batch_size,method,seed,n_shot,sample,accuracy -GIT,microsoft/git-base-vqav2,FP32,ai2d,2,vision_language_modeling,0,0,100,0.2500 -GIT,microsoft/git-base-vqav2,FP32,ai2d,2,vision_language_modeling,1,0,100,0.2400 -GIT,microsoft/git-base-vqav2,FP32,ai2d,2,vision_language_modeling,2,0,100,0.2700 -GIT,microsoft/git-base-vqav2,FP32,ai2d,2,vision_language_modeling,3,0,100,0.2300 -GIT,microsoft/git-base-vqav2,FP32,ai2d,2,vision_language_modeling,4,0,100,0.2800 -GIT,microsoft/git-base-textvqa,FP32,ai2d,2,vision_language_modeling,0,0,100,0.2900 -GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,vision_language_modeling,0,0,100,0.2600 -GIT,microsoft/git-base-textvqa,FP32,ai2d,2,vision_language_modeling,0,0,100,0.2600 -GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,vision_language_modeling,1,0,100,0.1300 -GIT,microsoft/git-base-textvqa,FP32,ai2d,2,vision_language_modeling,1,0,100,0.2300 -GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,vision_language_modeling,2,0,100,0.2500 -GIT,microsoft/git-base-textvqa,FP32,ai2d,2,vision_language_modeling,2,0,100,0.2600 -GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,vision_language_modeling,3,0,100,0.2500 -GIT,microsoft/git-base-textvqa,FP32,ai2d,2,vision_language_modeling,3,0,100,0.2400 -GIT,microsoft/git-base-textvqa,FP32,scienceqa,2,vision_language_modeling,4,0,100,0.2000 -GIT,microsoft/git-base-textvqa,FP32,ai2d,2,vision_language_modeling,4,0,100,0.3100 -GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,vision_language_modeling,0,0,100,0.3000 -GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,vision_language_modeling,1,0,100,0.2400 -GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,vision_language_modeling,2,0,100,0.2800 -GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,vision_language_modeling,3,0,100,0.3100 -GIT,microsoft/git-base-vqav2,FP32,scienceqa,2,vision_language_modeling,4,0,100,0.2400 -GIT,microsoft/git-base-vqav2,FP32,vqa,2,vision_language_modeling,0,0,100,0.6000 diff --git a/mm_poe/results/vision_language_modeling1.csv b/mm_poe/results/vision_language_modeling1.csv deleted file mode 100644 index 16757b4..0000000 --- a/mm_poe/results/vision_language_modeling1.csv +++ /dev/null @@ -1,12 +0,0 @@ -model_family,checkpoint,loading_precision,dataset,batch_size,method,seed,n_shot,sample,accuracy -BLIP2,Salesforce/blip2-opt-2.7b,INT8,vqa,16,vision_language_modeling,0,0,100,0.0600 -BLIP2,Salesforce/blip2-opt-2.7b,INT8,vqa,16,vision_language_modeling,0,0,100,0.0600 -BLIP2,Salesforce/blip2-opt-2.7b,INT4,vqa,4,vision_language_modeling,0,0,100,0.4300 -BLIP2,Salesforce/blip2-opt-2.7b,FP16,scienceqa,2,vision_language_modeling,0,0,100,0.3800 -BLIP2,Salesforce/blip2-opt-2.7b,FP16,scienceqa,2,vision_language_modeling,0,0,100,0.3800 -BLIP2,Salesforce/blip2-opt-2.7b,BF16,scienceqa,2,vision_language_modeling,0,0,100,0.2200 -BLIP2,Salesforce/blip2-flan-t5-xl,FP16,scienceqa,2,vision_language_modeling,0,0,100,0.3600 -BLIP2,Salesforce/blip2-opt-2.7b,FP16,ai2d,2,vision_language_modeling,0,0,100,0.2300 -PaliGemma,google/paligemma-3b-ft-ai2d-448,FP16,ai2d,2,vision_language_modeling,0,0,100,0.2100 -PaliGemma,google/paligemma-3b-ft-ai2d-448,FP16,ai2d,2,vision_language_modeling,0,0,100,0.2100 -GIT,microsoft/git-base-vqav2,FP32,ai2d,2,vision_language_modeling,0,0,100,0.2100 diff --git a/paper/figures/17.png b/paper/figures/17.png new file mode 100644 index 0000000..c15a66a Binary files /dev/null and b/paper/figures/17.png differ diff --git a/paper/figures/cli.png b/paper/figures/cli.png new file mode 100644 index 0000000..aa842db Binary files /dev/null and b/paper/figures/cli.png differ diff --git a/paper/paper.bib b/paper/paper.bib index 11309e5..08a8207 100644 --- a/paper/paper.bib +++ b/paper/paper.bib @@ -285,3 +285,26 @@ @conj{Idefics2 version = {8b}, howpublished = {\url{https://huggingface.co/HuggingFaceM4/idefics2-8b}} } + +@InProceedings{VQA, +author = {Stanislaw Antol and Aishwarya Agrawal and Jiasen Lu and Margaret Mitchell and Dhruv Batra and C. Lawrence Zitnick and Devi Parikh}, +title = {VQA: Visual Question Answering}, +booktitle = {International Conference on Computer Vision (ICCV)}, +year = {2015}, +} + +@article{Kembhavi2016ADI, + title={A Diagram is Worth a Dozen Images}, + author={Aniruddha Kembhavi and Michael Salvato and Eric Kolve and Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi}, + journal={ArXiv}, + year={2016}, + volume={abs/1603.07396}, + url={https://api.semanticscholar.org/CorpusID:2682274} +} + +@inproceedings{lu2022learn, + title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering}, + author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan}, + booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)}, + year={2022} +} \ No newline at end of file diff --git a/paper/paper.md b/paper/paper.md index d8ea868..6dcb80c 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -24,15 +24,15 @@ bibliography: paper.bib # Summary -This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, also know as Multi-Modal Process of Elimination (MM-PoE), a method to enhance vision language models' performance on multiple-choice visual reasoning by employing a two-step scoring system that first eliminates incorrect options and then predicts from the remaining ones. Our experiments across three question answering datasets show the method's effectiveness, particularly in visual reasoning tasks. This method addresses one of the main limitations of the paper [@ma2023poe] by extending to tasks involving multi-modalities and also includes experimentation techniques for few-shot settings. +This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, also know as Multi-Modal Process of Elimination (MM-PoE), a method to enhance vision language models' performance on multiple-choice visual reasoning tasks by employing a two-step scoring system that first eliminates incorrect options and then predicts from the remaining ones. Our experiments across three question answering datasets show the method's effectiveness, particularly in visual reasoning tasks. This method addresses one of the key limitations of the paper [@ma2023poe] by extending to tasks involving multi-modalities and also includes experimentation techniques for few-shot settings. # Statement of Need Large Language models (LLMs) excel at in-context learning for multiple choice reasoning tasks but often treat all options equally, unlike humans who typically eliminate incorrect choices before selecting the correct answer. Same is true for vision language models (VLMs) in case of visual question answering tasks with multiple choices. This discrepancy can limit the effectiveness of vision language models in accurately solving such tasks. To address this, we introduce Multi-Modal Process of Elimination (MM-PoE), a two-step scoring method designed to enhance VLM performance by mimicking human reasoning strategies in multi-modal settings. -In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios . Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs). +In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios. Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs). -By implementing MM-PoE, researchers and practitioners can experiment and significantly improve the accuracy and reliability of VLMs in multiple choice reasoning tasks, making it a valuable tool for advancing machine learning models for visual reasoning. +Using this tool, researchers and practitioners can experiment and significantly improve the accuracy and reliability of VLMs in multiple choice reasoning tasks, making it a valuable tool for advancing machine learning models for visual reasoning. # State of the Field @@ -106,13 +106,13 @@ To further explore the versatility of MM-PoE, we also examined its performance i ## Data -Our experiments were conducted on three different multiple-choice visual reasoning datasets, selected to cover a broad spectrum of reasoning types and complexities. These tasks include both traditional reasoning tasks and more specialized ones designed to test specific reasoning skills. To ensure a comprehensive evaluation, we used train sets from established benchmarks when available; otherwise, we utilized development sets. +Our experiments were conducted on three different multiple-choice visual reasoning datasets - Visual Question Answering(VQA) [@VQA], ScienceQA [@lu2022learn] and Diagram Understanding(AI2D) [@Kembhavi2016ADI], selected to cover a broad spectrum of reasoning types and complexities. These tasks include both traditional visual reasoning tasks and more specialized ones designed to test specific reasoning skills. To ensure a comprehensive evaluation, we used train sets from established benchmarks when available; otherwise, we utilized development sets. In case of varying number of options in the multiple-choice answers for SceinceQA and AI2D datasets, we filtered questions containing image context and exactly four options. | Dataset | #Options | Train | Dev | Test | |----|------|------|------|-----------| -|VQA v1.0| 18 | 248,349 | 121,512 | 244,302 | -|ScienceQA | 4 | 2221 | | | -| AI2D | 4 | | | | +| VQA | 18 | 248,349 | 121,512 | 244,302 | +| ScienceQA | 4 | 12726 | 4241 | 4241 | +| AI2D | 4 | 3921 | 982 | - | ## Model @@ -145,22 +145,37 @@ MM-PoE consistently outperformed or matched the best-performing baselines across | Model | Dataset | LM | AVG | Calibration | Channel | MCP | PoE | |----|------|------|------|-----------|---|---|---| -|microsoft/git-base-vqav2| VQA | | | | | | | | -|microsoft/git-base-vqav2| ScienceQA | 27.4 | | 23.2| 24.6 | 25.8 | 27.2 | -|microsoft/git-base-vqav2| AI2D | 25.4| | 26.4| 25.4 | 25.3 | 26.5 | -|microsoft/git-base-textvqa| VQA | | | | | | | -|microsoft/git-base-textvqa| ScienceQA | 21.8| | 25.8 | 23.4 | 23.6 | 28.2 | -|microsoft/git-base-textvqa| AI2D | 26.5 | | 20.8| 26.2 | 24.2| 26.8 | +|microsoft/git-base-vqav2| VQA | 45 | 43 | 38 | | | | | +|microsoft/git-base-vqav2| ScienceQA | 27.4 | 17.8 | 23.2| 24.6 | 25.8 | 27.2 | +|microsoft/git-base-vqav2| AI2D | 25.4| 26.2 | 26.4| 25.4 | 25.3 | 26.5 | +|microsoft/git-base-textvqa| VQA | 18.5 | 17 | | | | | +|microsoft/git-base-textvqa| ScienceQA | 21.8 | 20.4 | 25.8 | 23.4 | 23.6 | 28.2 | +|microsoft/git-base-textvqa| AI2D | 26.5 | 27.6 | 20.8| 26.2 | 24.2| 26.8 | **Table 1**: Comparison of Multiple-Choice Prompting (MCP) and Process of Elimination (PoE) accuracy scores on 3 visual question answering datasets for the `microsoft/git-base-vqav2` and `microsoft/git-base-textvqa` models in the zero-shot settings. Each dataset has different number of answer choices. PoE largely outperforms MCP on all the visual reasoning tasks for the two multi-modal models mentioned. -## Example +## Examples -Example +### ScienceQA Example +Example **Question**: Which of these states is farthest north?
-**Choices**: West Virginia, Louisiana, Arizona, Oklahoma
-**Predicted**: 0 +**Options**: West Virginia, Louisiana, Arizona, Oklahoma
+**Ground Truth Option**: West Virginia + +**Predicted Masks**: West Virginia, Louisiana, [MASK], [MASK]
+**Predicted Option**: West Virginia + +### AI2D Example + +Example + +**Question**: Are phytoplankton predators or prey in this food chain?
+**Options**: producer, predator, prey, NA
+**Ground Truth Option**: prey + +**Predicted Masks**: [MASK], predator, prey, NA
+**Predicted Option**: prey # Conclusion