GPT 5.1 / Codex already beats Gemini 3 on SWE Bench Verified and Terminal Bench ...

skhameneh · 2025-11-20T17:45:04 1763660704

There’s been community commentary that many of the GPT models are a tad overfitted WRT benchmarks. Benchmarks are not representative of end user experiences. That’s not to say the benchmarks aren’t useful at all, but are only useful as a subjective indicator.

knowriju · 2025-11-20T14:28:05 1763648885

Would it be fair to compare a generic model with a model finetuned for coding?