### Vision language models are blind

*Equal contribution 1Auburn University, 2University of Alberta, Abstract Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro are powering countless image-text processing applications and scoring high on existing vision-understanding benchmarks. Yet, we find that VLMs fail on 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap;

^{*}Equal contribution

^{1}Auburn University,

^{2}University of Alberta,

## Abstract

Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5

Pro are

powering countless image-text processing applications and scoring high on existing vision-understanding

benchmarks.

Yet, we find that VLMs fail on 7 visual tasks *absurdly easy* to humans such as identifying (a)

whether two

circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d)

counting the

number of circles in an Olympic-like logo.

The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like that

of a person

with myopia seeing fine details as

blurry, and at

worst, like an intelligent person who is blind making

educated

guesses.

## Task 1: Counting line intersections

Given the impressive accuracy of VLMs on answering questions on diagrams and charts (e.g., Sonnet-3.5 scoring 94.7% on AI2D and 90.8% on

ChartQA) [1], a reasonable hypothesis is that VLMs must be able

to see whether two graphs

intersect in a

chart. Here, we test this hypothesis by asking VLMs to count the number of intersections between two 2-segment

piece-wise linear functions.

### Images

We create 150 images (see Figure 1) of 2D line plots drawn on a white canvas. Each line plot consists of two

line

segments, defined by three points whose x-coordinates are fixed and equally spaced. The y-coordinates are

randomly

sampled to create two plots that intersect at exactly 0, 1 or 2 points. See Appendix A for more details.

### Prompts

We ask each question using two different wordings:

*“How many times do the blue and red line plots cross each other?”**“How many times do the blue and red lines intersect?”*

### Groundtruth

Answers are ∈ {0, 1, 2} (random-baseline accuracy: 33%).

## Results

The following table shows the performance of the four models on the task of counting line intersections.

Thickness | ||||
---|---|---|---|---|

2 | 45.00 | 70.00 | 64.00 | 80.00 |

3 | 47.00 | 68.00 | 66.00 | 79.00 |

4 | 54.00 | 71.00 | 62.00 | 73.00 |

Average |
48.67 | 69.67 | 64.00 | 77.33 |

## Qualitative samples

GPT-4o

Gemini-1.5

Pro

Sonnet-3

Sonnet-3.5

## Task 2: Two circles

In contrast to Task 1 where we tested VLMs on thin lines, here we evaluate their ability to perceive

interactions

between larger objects – specifically, two same-sized filled circles. This task assesses VLMs’ capability to

detect

(1) small gaps between circles and (2) overlapping circles.

### Images

We generate 672 images of two circles on a white canvas. The circles vary in size, distance, and orientation:

- Circle diameters: 1/4, 1/5, 1/6, or 1/7 of the canvas size
- Distances between circle perimeters: -0.15 to 0.5 times the diameter
- Orientations: 90°, 0°, -45°, and 45° angles with the x-axis
- Canvas sizes: 384, 769, and 1155 pixels

### Prompts

We ask each question using two different wordings:

*“Are the two circles touching each other? Answer with Yes/No.”**“Are the two circles overlapping? Answer with Yes/No.”*

### Groundtruth

Answers are based on the distance d between circle perimeters:

- d < 0: Overlapping and touching
- d = 0: Non-overlapping but touching
- d > 0: Non-overlapping and non-touching

Random-baseline accuracy: 50%.

## Results

The following table shows the performance of the four models on the task of counting line intersections.

Overlapping | 71.27 | 93.30 | 88.09 | 88.83 |

Touching | 74.10 | 92.26 | 80.95 | 94.49 |

Average |
72.69 | 92.78 | 84.52 | 91.66 |

## Qualitative samples

GPT-4o

Gemini-1.5

Pro

Sonnet-3

Sonnet-3.5

regardless of the actual distance

between the two circles.

## Task 3: The circled letter

Consistent with prior reports [2][3][4], we find

that VLMs can

100%

accurately identify a primitive shape (e.g., a red circle ⭕)[2]

and can perfectly read an English word (e.g.,

Subdermatoglyphic) alone. Here, we superimposed the red circle on every

letter, one at a time, in the word,

and ask

VLMs to identify which letter is being circled. While the task is easy to humans, our hypothesis is that if a

VLM’s

vision is “blurry”, it might not be able to identify the exact letter being circled since there is tiny

spacing

between the adjacent letters.

### Images

We choose three strings Acknowledgement, Subdermatoglyphic, and tHyUiKaRbNqWeOpXcZvM because they contain

characters

of variable widths and heights. Furthermore, all four tested VLMs can read out all characters in these strings

when

they are input to the models as an image. While Acknowledgement is a common

English word, Subdermatoglyphic is

the

longest word without repetitive letters. We also test VLMs on the random string tHyUiKaRbNqWeOpXcZvM to

estimate how

much model accuracy is due to its familiarity with the word.

For each (string, circled-letter) pair, we render a 512×512 image by choosing among 3 red oval line-thickness

levels,

2 font sizes, and 4 random positions in the canvas for a total of 24 images. That is, we generate 360, 408,

and 480

images for Acknowledgement (15 letters), Subdermatoglyphic (17 letters), and

tHyUiKaRbNqWeOpXcZvM (20

letters),

respectively. We ensure each letter to be circled fits completely the oval.

letters.

### Prompts

We ask each question using two different wordings:

*“Which letter is being circled?”**“Which character is being highlighted with a red oval?”*

### Groundtruth

Letters need to match predicted letters exactly (case-insensitive).

## Results

The following table shows the performance of the four models on the task of identifying the circled letter.

Word | ||||
---|---|---|---|---|

Acknowledgement | 69.03 | 97.50 | 82.64 | 91.11 |

Subdermatoglyphic | 63.60 | 91.05 | 71.45 | 94.49 |

tHyUiKaRbNqWeOpXcZvM | 77.92 | 89.90 | 65.94 | 82.08 |

Average |
70.18 | 92.81 | 73.34 | 89.22 |

## Qualitative samples

GPT-4o

Gemini-1.5

Pro

Sonnet-3

Sonnet-3.5

(Acknowledgement & Subdermatoglyphic)

and a random string

(tHyUiKaRbNqWeOpXcZvM). When making mistakes, VLMs

tend to

predict letters adjacent to the circled one.

## Task 4: Counting overlapping shapes

Aligned with prior research [4], we also find VLMs to be able

to count disjoint circles. Yet, here, we test

VLMs on

counting circles that are *intersecting* like in the Olympic logo—a common cognitive development

exercise for

preschoolers [5][6].

Our hypothesis is that a “blurry” vision may not see the

intersection between two circles

clearly

and therefore unable to trace circles and count them. For generalization of our findings, we repeat the

experiment

with pentagons as well.

### Images

In an image of size C×C, where C ∈ {384, 769, 1155}px, we draw N ∈ {5, 6, 7, 8, 9} overlapping, same-sized

circles

arranged in two rows like the Olympic logo. A circle diameter φ ∈ {C/5, C/10}. We repeat the images with two

different

line thickness for rendering circles. This procedure renders 3 resolutions × 5 × 2 diameters = 60 images. We

repeat

for pentagons in addition to circles, resulting in 60 × 2 shapes = 120 images in total. For pentagons, their

side

length d ∈ {C/5, C/10}.

sizes,

and colors.

### Prompts

We ask each question using two different wordings:

*“How many {shapes} are in the image? Answer with only the number in numerical format.”**“Count the {shapes} in the image. Answer with a number in curly brackets e.g. {3}.”*

Where {shapes} is either “circles” or “pentagons” depending on the image.

### Groundtruth

Answers are ∈ {5, 6, 7, 8, 9} (random-baseline accuracy: 20%).

## Results

The following table shows the performance of the four models on the task of identifying the circled letter.

Circles | 42.50 | 20.83 | 31.66 | 44.16 |

Pentagons | 19.16 | 9.16 | 11.66 | 75.83 |

## Qualitative samples

GPT-4o

Gemini-1.5

Pro

Sonnet-3

Sonnet-3.5

Pro often predicts “5” circles.

## Task 5: Counting the nested squares

Motivated by the findings that VLMs struggle in counting the intersected circles (Task 4), here, we arrange

the shapes

differently so that their edges do *not* intersect.

That is, each shape is nested entirely inside another. For completeness, we test squares in this task.

### Images

In a canvas of size C×C, we render N ∈ {2, 3, 4, 5} nested squares.

The outermost square is rendered first using a random edge length d and a line thickness ∈ {2, 3, 4}px.

The remaining N-1 squares are drawn using a size reduction factor, 0.75 × d and placed at a random coordinate

that

ensures they do not touch outer squares.

For each line thickness, we generate 10 images (where squares have different, random locations) to create 3 ×

10 = 30

images.

Repeating the process for all N values results in 4 × 30 = 120 images.

### Prompts

We ask each question using the following wording:

*“Count the total number of squares in the image.”*

Where {shapes} is either “circles” or “pentagons” depending on the image.

### Groundtruth

Answers are ∈ {2, 3, 4, 5} (random-baseline accuracy: 25%).

## Results

The following table shows the performance of the four models on the task of counting nested squares.

Squares | 48.33 | 80.00 | 55.00 | 87.50 |

## Qualitative samples

GPT-4o

Gemini-1.5

Pro

Sonnet-3

Sonnet-3.5

the

squares in a majority of the images.

## Task 6: Counting the rows and columns of a grid

The results from prior tasks show VLMs cannot always count shapes that are overlapping (Task 4) or nested

(Task 5).

What about adjacent shapes? Here, we tile up shapes (specifically, squares) into a grid and challenge VLMs to

count—a

task that is supposedly simple to VLMs given their remarkable performance (≥ 90% accuracy) on DocVQA, which

includes

many questions with tables.

To simplify the task, we ask models to count the number of rows and columns in a given table.

### Images

A grid may have N×N, N×N’, or N’×N cells, where N∈{3, 4, 5, 6, 7, 8, 9} and N’ = N + 1.

Each grid is rendered with two different line-thicknesses on a canvas of size C×C where C∈{500, 1250, 2000}px.

Besides empty grids, we also replicate the procedure to make grids contain text (which is more common in

real-world

tables) where each cell contains a single random word.

Two versions combined have 2×222 = 444 images.

dimensions.

### Prompts

We ask each question using two different wordings:

*“Count the number of rows and columns and answer with numbers in curly brackets. For example, rows={5}*

columns={6}”*“How many rows and columns are in the table? Answer with only the numbers in a pair (row, column),*

e.g.,

(5,6)”

### Groundtruth

Answers include both the number of rows and columns. An answer is correct when both column and row counts

are

correctly predicted.

## Results

The following table shows the performance of the four models on the task of counting rows and columns in

grids.

Grid type | ||||
---|---|---|---|---|

Blank | 26.13 | 25.75 | 25.00 | 59.84 |

Text | 53.03 | 45.83 | 47.34 | 88.68 |

Average |
39.58 | 35.79 | 36.17 | 74.26 |

## Qualitative samples

GPT-4o

Gemini-1.5

Pro

Sonnet-3

Sonnet-3.5

columns of

blank grids.

GPT-4o

Gemini-1.5

Pro

Sonnet-3

Sonnet-3.5

especially

Sonnet-3.5.

## Task 7: Following single-colored paths

It is important for VLMs to be able to follow paths in order to read maps or charts, interpret graphs, and

understand

user notations (e.g., arrows) in input images. To assess path-following capability, this task asks models to

count the

unique-color paths between two given stations in a simplified subway map. This is another easy-to-humans task

that

challenges VLMs significantly.

### Images

We create each subway map on an image of size C×C, where C ∈ {512, 1024}px. We write 4 station names (A, B, C,

D) at 4

fixed coordinates. We divide the canvas into an invisible grid of 18×18 cells and initialize 3 path-starting

points

C/18px away from each station. We draw a path, using the depth-first search algorithm starting from a random

station

and a random starting point, where a valid move is one cell in any direction: North, south, east or west. We

repeat

the process so that each station has exactly N ∈ {1, 2, 3} outgoing paths, for a total of 180 maps.

variations

in path thickness.

### Prompts

We ask each question using two different wordings:

*“How many single-colored paths go from A to C? Answer with a number in curly brackets, e.g., {3}”**“Count the one-colored routes that go from A to C. Answer with a number in curly brackets, e.g.,*

{3}.”

### Groundtruth

Answers are ∈ {0, 1, 2, 3} (random-baseline accuracy: 25%).

## Results

The following table shows the performance of the four models on the task of counting single-colored paths

between

stations.

Paths | ||||
---|---|---|---|---|

1 | 67.50 | 85.41 | 23.75 | 95.00 |

2 | 44.37 | 28.75 | 37.18 | 56.25 |

3 | 36.71 | 25.78 | 15.42 | 25.39 |

Average |
45.89 | 40.01 | 23.78 | 50.18 |

## Qualitative samples

GPT-4o

Gemini-1.5

Pro

Sonnet-3

Sonnet-3.5

Sonnet-3)

surprisingly fail in even extremely easy cases (leftmost).

As the

number of paths exiting each station increases, VLMs tend to perform worse.