Not so relevant to the thread but ive been uploading screenshots from citrix guis and asking qwen3-vl for the appropriate next action eg Mouseclick, and while it knows what to click it struggles to accurately return which pixel coordinates to click. Anyone know a way to get accurate pixel coordinates returned?
How do you prompt the model? In my experience, Qwen3-VL models have very accurate grounding capabilities (I’ve tested Qwen3-VL-30B-A3B-Instruct, Qwen3-VL-30B-A3B-Thinking, and Qwen3-VL-235B-A22B-Thinking-FP8).
Note that the returned values are not direct pixel coordinates. Instead, they are normalized to a 0–1000 range. For example, if you ask for a bounding box, the model might output:
It’s been about a year since I looked into this sort of thing, but molmo will give you x,y coordinates. I hacked together a project about it. I also think Microsoft’s omniparser is good at finding coordinates too.
It's very not accurate, but sometimes instructing to return pyautogui code works.
prompt: I attach a screenshot (1920x1080). Write code to click the submit button using pyautogui.
attachment: <screenshot>
reply:
import pyautogui
pyautogui.click(100, 200)
you want get the exact coordinated by running a key point network to pinpoint which coordinates does the next click point is you can. here I show a example simple prompt which returns the keypoint location of the next botton to click and visually localize the point with a keypoint in the image