Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

GPT 5.1 / Codex already beats Gemini 3 on SWE Bench Verified and Terminal Bench and this pushes the gap further. Seems like a decent improvement.


There’s been community commentary that many of the GPT models are a tad overfitted WRT benchmarks. Benchmarks are not representative of end user experiences. That’s not to say the benchmarks aren’t useful at all, but are only useful as a subjective indicator.


Would it be fair to compare a generic model with a model finetuned for coding?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: