Late, but reading all of the replies, and speaking from my own observation using...

Late, but reading all of the replies, and speaking from my own observation using Claude, Codex, as well as (non-CLI) Gemini, Kimi, Qwen, and Deepseek...

It's fun how we are so quick to assign meaning to the way these models act. This is of course due to training, RLHF, available tool calls, system prompt (all mostly invisible) and the way we prompt them.

I've been wondering about a new kind of benchmark how one would be able to extract these more intangible tendencies from models rather than well-controlled "how good at coding is it" style environments. This is mainly the reason why I pay less and less attention to benchmark scores.

For what it's worth: I still best converse with Claude when doing code. Its reasoning sounds like me, and it finds a good middle ground between conservative and crazy, being explorative and daring (even although it too often exclaims "I see the issue now!"). If Anthropic would lift the usage rates I would use it as my primary. The CLI tool is also better. E.g. Codex with 5.1 gets stuck in powershell scripts whilst Claude realizes it can use python to do heavy lifting, but I think that might be largely due to being mainly on Windows (still, Claude does work best, realizing quickly what environment it lives in rather than trying Unix commands or powershell invocations that don't work because my powershell is outdated).

Qwen is great in an IDE for quick auto-complete tasks, especially given that you can run it locally, but even the VSCode copilot is good enough for that. Kimi is promising for long running agentic tasks but that is something I've barely explored and just started playing with. Gemini is fantastic as a research assistant. Especially Gemini 3 Pro points out clear and to the point jargon without fear of the user being stupid, which the other commercial models are too often hesitant to do.

Again, it would be fun to have some unbiased method to uncover some of those underlying persona's.