How do you prompt the model? In my experience, Qwen3-VL models have very accurate grounding capabilities (I’ve tested Qwen3-VL-30B-A3B-Instruct, Qwen3-VL-30B-A3B-Thinking, and Qwen3-VL-235B-A22B-Thinking-FP8).
Note that the returned values are not direct pixel coordinates. Instead, they are normalized to a 0–1000 range. For example, if you ask for a bounding box, the model might output:
> But it has fundamentally no clue about the characters that make up this word (unless someone trained it to do so or by using spurious additional relations that might exist in the training data).
That was my theory as well when I first saw the strawberry test. However, it is easy test if they know how to spell.
The most obvious is:
> Can you spell "It is wonderful weather outside. I should go out and play.". Use capital letters, and separate each letter with a space.
The free tier ChatGPT model is smart enough to understand the following instructions as well which shows that its not just the simple words:
> I was wondering if you can spell. When I ask you a question, answer me with capital letters, and separate each word with a space. When there is real space between the letters, insert character '--' there, so the output is easier to read. Tell me how the attention mechanism works in the modern transformer language models.
Also somebody pointed out in some other HN thread that the modern LLMs are perfect for dyslexic people, because you can typo every single word and the model still understands you perfectly. Not sure how true this is, but at least a simple example seems to work:
> Hlelo, how aer you diong. Cna you undrestnad me?
It would be interesting to know if the datasets actually include spelling examples, or if the models learn how to spell form the massive amount of spelling mistakes in the datasets.
They can do this kind of thing, but in my experience, that makes the model feel "dumber" as far as quality of output goes (unless you make it produce normal output first before having it convert it to something else).
Mine has worked almost flawlessly since I got it 2 years ago. I have not had a single problem that a simple reboot has not fixed. The only bad thing in the device was the thingies where you screw the antennas. In my router they were somehow loose and I had to open the router case to tighten them. Otherwise I could not have been happier that I bought mine.
The turris team actually started another kickstarter project last summer:
https://mox.turris.cz/en/overview/
I instantly backed that project too. Not because I needed a new router, but because I have a soft spot for any Open Source / Open Hardware projects and I wanted to support them.
I just received my Turris Omnia yesterday and I have to say that I'm very pleased with it. The build quality is good and it looks very nice in person. The web UI is snappy and everything was a joy to set up. The OS is based on OpenWRT so all the guides/documentation for OpenWRT works on Turris.
I previously used the Netgear R7000 "nighthawk" with DD-WRT and it worked just fine but the lack of updates and up to date documentation was very offputting. Somehow the whole setup felt like a massive hack.
I'm very surprised that there is no more talk about Turris Omnia here on HN. It's Open Hardware running Open Software. One would think that people here would be very interested in that kind of product.
Note that the returned values are not direct pixel coordinates. Instead, they are normalized to a 0–1000 range. For example, if you ask for a bounding box, the model might output:
```json [ {"bbox_2d": [217, 112, 920, 956], "label": "cat"} ] ```
Here, the values represent [x_min, y_min, x_max, y_max]. To convert these to pixel coordinates, use:
[x_min / 1000 * image_width, y_min / 1000 * image_height, x_max / 1000 * image_width, y_max / 1000 * image_height]
Also, if you’re running the model with vLLM > 0.11.0, you might be hitting this bug: https://github.com/vllm-project/vllm/issues/29595