Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interesting "ScreenSpot Pro" results:

    72.7% Gemini 3 Pro
    11.4% Gemini 2.5 Pro
    49.9% Claude Opus 4.5
    3.50% GPT-5.1
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

https://arxiv.org/abs/2504.07981





I was surprised at how poorly GPT-5 did in comparison to Opus 4.1 and Gemini 2.5 on a pretty simple OCR task a few months ago - I should run that again against the latest models and see how they do. https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...

Agreed, GPT-5 and even 5.1 is noticeably bad at OCR. OCRArena backs this up: https://www.ocrarena.ai/leaderboard (I personally would rank 5.1 as even worse than it is there).

According to the calculator on the pricing page (it's inside a toggle at the bottom of the FAQs), GPT-5 is resizing images to have a minor dimension of at most 768: https://openai.com/api/pricing/ That's ~half the resolution I would normally use for OCR, so if that's happening even via the API then I guess it makes sense it performs so poorly.


That is... astronomically different. Is GPT-5.1 downscaling and losing critical information or something? How could it be so different?

This is my default explanation for visual impairments in LLMs, they're trying to compress the image into about 3000 tokens, you're going to lose a lot in the name of efficiency.

I found much better results with smallish UI elements in large screenshots on GPT by slicing it up manually and feeding them one at a time. I think it does severely lossy downscaling.

It has a rather poor max resolution. Higher resolution images get tiled up to a point. 512 x 512, I think is the max tile size, 2048 x 2048 the max canvas.

impressive.....most impressive

its going to reach low 90s very soon if trends continue




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: