by Pooyan Rahmanzadehgervi1,*, Logan Bolton1,*, Mohammad Reza Taesiri2, Anh Totti Nguyen1
*Equal contribution
1Auburn University, 2University of Alberta
This repository contains the code and data for the paper Vision Language Models Are Blind
.
@article{vlms2024blind,
title={Vision language models are blind},
author={Rahmanzadehgervi, Pooyan and Bolton, Logan and Taesiri, Mohammad Reza and Nguyen, Anh Totti},
journal={arXiv preprint arXiv:2407.06581},
year={2024}
}
While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.12% accurate on average. Claude 3.5 Sonnet performs the best at 74.94% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together. Code and data are available at: https://vlmsareblind.github.io
-
Find images in
src/{task}
directory. For example: this image in the gpt-4o/incorrect folder. -
Locate corresponding prompts in prompts.md. For example:
Are the two circles touching each other? Answer with Yes/No.
-
Input the above input image and prompt to models via default API settings or official playground, NOT using their web interface (e.g. use https://platform.openai.com/playground/chat for GPT-4o)
-
Compare your results with our paper, noting that variations may occur due to the default
temperature = 1
setting.
Important: Using the web interface (e.g., chatgpt.com) of the models may result in very different results from our paper.