Vision Language Models Are Blind

by Pooyan Rahmanzadehgervi^1,*, Logan Bolton^1,*, Mohammad Reza Taesiri², Anh Totti Nguyen¹

^*Equal contribution
¹Auburn University, ²University of Alberta

This repository contains the code and data for the paper Vision Language Models Are Blind.

@article{vlms2024blind,
  title={Vision language models are blind},
  author={Rahmanzadehgervi, Pooyan and Bolton, Logan and Taesiri, Mohammad Reza and Nguyen, Anh Totti},
  journal={arXiv preprint arXiv:2407.06581},
  year={2024}
}

Abstract

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.12% accurate on average. Claude 3.5 Sonnet performs the best at 74.94% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together. Code and data are available at: https://vlmsareblind.github.io

How to Reproduce Results

Find images in src/{task} directory. For example: this image in the gpt-4o/incorrect folder.
Locate corresponding prompts in prompts.md. For example: Are the two circles touching each other? Answer with Yes/No.
Input the above input image and prompt to models via default API settings or official playground, NOT using their web interface (e.g. use https://platform.openai.com/playground/chat for GPT-4o)
Compare your results with our paper, noting that variations may occur due to the default temperature = 1 setting.