ChatGPT4o vs. Math
In this series, I test drive OpenAI’s multimodal ChatGPT4o. For part 1, click here. I want to know: can GPT4o solve this problem by analyzing just the prompt? can GPT4o solve this problem by combining prompt and image? can GPT4o solve this problem with the help of prompt engineering? Here’s the image of the math
In this series, I test drive OpenAI’s multimodal ChatGPT4o.
For part 1, click here.
I want to know:

can GPT4o solve this problem by analyzing just the prompt?

can GPT4o solve this problem by combining prompt and image?

can GPT4o solve this problem with the help of prompt engineering?
Here’s the image of the math problem:
Problem Statement
There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?
Neil Fraser
Solution
Reduce the problem to 2 dimensions.
Here’s an ASCII Unrolled Tape:
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Unrolled Tape Area = T * L
L
= length
T
= thickness
Here’s an ASCII Rolled Tape:
,,ggddY""""Ybbgg,,
,agd""' `""bg,
,gdP" "Ybg,
,dP" "Yb,
,dP" _,,ddP"""Ybb,,_ "Yb,
,8" ,dP"' `"Yb, "8,
,8' ,d" "b, `8,
,8' d" "b `8,
d' d' `b `b
8 8 8 8
8 8 8 8
8 8 8 8
8 Y, ,P 8
Y, Ya aP ,P
`8, "Ya aP" ,8'
`8, "Yb,_ _,dP" ,8'
`8a `""YbbgggddP""' a8'
`Yba adP'
"Yba adY"
`"Yba, ,adP"'
`"Y8ba, ,ad8P"'
``""YYbaaadPP""''
Rolled Tape Area = pi (R^2  r^2)
R
= outer radius
r
= inner radius
The areas are the same!
So we can easily solve for thickness T
= 0.00589 cm
Here are my varied experiments:

Prompt only, no image

Zeroshot ChainofThought

Dimensions inside the image, missing data

Prompt and image

Zeroshot ChainofThought and image
Despite the same input, there is no guarantee I’ll get the same outputs.
I designed the experiments to evaluate the impact of:

one modality (text only)

multi modality (text + image)

prompt engineering (Chain of Thought)
Which approach leads to superior outcomes?
Take a guess now and see if you’re right 🙂
First, I test one modality with no prompt engineering:
I give GPT4o the text prompt, without the image.
There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?
1st run — choke
GPT4o gives up after teasing me:
“Given the complexity, let’s solve this equation numerically”.
2nd run — correct
Yay!
GPT4o gets the right answer on the 2nd try, without the image, without any prompt engineering.
3rd run — incorrect
Unfortunately, the 3rd try was wrong.
The probabilistic nature of LLMs rears its head…
Second, I test one modality, assisted by prompt engineering:
I give GPT4o the text prompt, without the image.
Then I add a simple prompt engineering technique:
Take a deep breath and work on this problem stepbystep.
Sabrina Ramonov @ sabrina.dev
Seems too simple, right? 😅
This prompt engineering technique is called ChainofThought.
It’s proven to improve ChatGPT’s performance on logic and reasoning tasks by requiring it to explain intermediate steps leading to an answer.
Full prompt:
There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?
Take a deep breath and work on this problem stepbystep.
1st run – correct
2nd run – correct
3rd run – correct
Quite a surprise, this absurdly simple prompt engineering technique resulted in 3/3 correct answers!
Third, I test multi modality (image) and a minimal text prompt.
I remove dimension data from the text prompt, so GPT4o must analyze the image correctly to extract the tape roll’s dimensions (radius and diameter).
However, the length of tape unrolled is neither in the image nor text prompt.
I expect GPT4o’s output to be something like, “without knowing the length we can’t determine it”.
Image uploaded to ChatGPT4o
There is a roll of tape with dimensions specified in the picture. How thick is the tape?
1st run – incorrect
2nd run – incorrect
3rd run – incorrect
Sabrina Ramonov @ sabrina.dev
Interestingly, ChatGPT4o successfully analyzes the image to determine the outer diameter 10cm and inner diameter 5cm.
But misinterprets the problem statement:
GPT4o interprets “how thick is the tape” as referring to the crosssection of the tape roll, rather than the thickness of a piece of tape.
Recall the original prompt which has:

dimension data

length of tape unrolled

the concept of rolled vs unrolled tape
There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?
Neil Fraser
Missing this important context, GPT4o should’ve said it can’t solve the problem. But it went ahead and tried anyway with a different interpretation, indeed a pretty reasonable interpretation given the data at hand.
Fourth, I test multi modality (image) and a text prompt that includes the length of tape unrolled.
There is a roll of tape with dimensions specified in the picture. The tape is 100 meters long when unrolled. How thick is the tape?
Image uploaded to ChatGPT4o
1 — choke
Well, this is amusing…
GPT4o notices its estimate seems unusually large and tries to course correct!
But then it gives up… dying with a grammatically incorrect last sentence:
I will recalculation next response
ChatGPT4o’s last words…
Sabrina Ramonov @ sabrina.dev
2 — incorrect
The 2nd run is better, still wrong, but at least GPT4o didn’t choke.
Sabrina Ramonov @ sabrina.dev
3 — correct
Yay! GPT4o finally got it right.
1/3 correct doesn’t seem super reliable. I thought multimodality would improve accuracy, but so far, it seems to create confusion.
Sabrina Ramonov @ sabrina.dev
Fifth, I test multi modality (image), a text prompt that includes the length of tape unrolled, assisted by ChainofThought prompt engineering.
Image uploaded to ChatGPT4o
There is a roll of tape with dimensions specified in the picture. The tape is 100 meters long when unrolled. How thick is the tape?
Take a deep breath and work on this problem stepbystep.
1 — incorrect
2 — incorrect
3 — incorrect
Wow, didn’t expect that!
Recall test #2 — text prompt with prompt engineering resulted in 3/3 correct.
In this multimodal test, I’ve added the image as supporting context, yet all 3 answers are wrong. I mistakenly assumed more context would help.
But notice GPT4o incorrectly interprets 5cm as radius, instead of diameter:
Sabrina Ramonov @ sabrina.dev
Key takeaway:
The emphasis here is consistency.
Previously with ChainofThought, I got the same answer 3 times in a row.
But because GPT4o’s image understanding mistakenly thought 5cm was radius, not diameter, it was consistently wrong by a factor of 4.
It seems GPT4o’s image understanding struggles with these finer details.
Reiterating my goal at the start, I wanted to know:

can GPT4o solve this problem by analyzing just the prompt?

can GPT4o solve this problem by combining prompt and image?

can GPT4o solve this problem with the help of prompt engineering?
I tested single vs multi modality, as well as the prompt engineering technique called ChainofThought.
One Modality

Prompt only, no image

Zeroshot Chain of Thought
Multi Modality

Dimensions inside image, missing data

Prompt and image

Zeroshot ChainofThought and image
The Winner?
One modality
Textonly prompt with zeroshot ChainofThought prompt engineering 🥳
Be honest, was that your first guess?
This concludes part 2 of this series Test Driving ChatGPT4o!
For part 1, click here.