Hacker Newsnew | past | comments | ask | show | jobs | submit | tananaev's commentslogin

I was very skeptical about Codex at the beginning, but now all my coding tasks start with Codex. It's not perfect at everything, but overall it's pretty amazing. Refactoring, building something new, building something I'm not familiar with. It is still not great at debugging things.

One surprising thing that codex helped with is procrastination. I'm sure many people had this feeling when you have some big task and you don't quite know where to start. Just send it to Codex. It might not get it right, but it's almost always good starting point that you can quickly iterate on.


Infinitely agree with all. I was skeptical, and then tried Opus 4.5 and was blown away. Codex with 5.0 and 5.1 wasn't great, but 5.2 is big improvement. I can't do code without it because there's no point. Time and quality with the right constraints, you're going to get better code.

And same thought with both procrastination because of not knowing where to start, but also getting stuck in the middle and not knowing where to go. Literally never happens anymore. Having discussions with it for doing the planning and different options for implementations, and you get to the end with a good design description and then, what's the point of writing the code yourself when with that design, it's going to write it quickly and matching the agreements.


You can code without it. Maybe you don't want to, but if you're a programmer, you can

(here I am remembering a time I had no computer and would program data structures in OCaml with pen and paper, then would go to university the next day to try it. Often times it worked the first try)


Sure, but the end of this post [0] is where I'm at. I don't feel the need or want to write the code when I can spend my time doing the other parts that are much more interesting and valuable.

> Emil concluded his article like this:

> JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn’t have written it this quickly without the agent. > But “quickly” doesn’t mean “without thinking.” I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. > That’s probably the right division of labor.

>I couldn’t agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what’s left to be a much more valuable use of my time.

[0] https://simonwillison.net/2025/Dec/14/justhtml/


But are those tests relevant? I tried using LLMs to write tests at work and whenever I review them I end up asking it “Ok great, passes the test, but is the test relevant? Does it test anything useful?” And I get a “Oh yeah, you’re right, this test is pointless”

Keep track of test coverage and ask it to delete tests without lowering coverage by more than let’s say 0.01 percent points. If you have a script that gives it only the test coverage, and a file with all tests including line number ranges, it is more or less a dumb task it can work on for hours, without actually reading the files (which would fill context too quickly).

That does not work as advertised.

If you leave an agent for hours trying to increase coverage by percentage without further guiding instructions you will end up with lots of garbage.

In order to achieve this, you need several distinct loops. One that creates tests (there will be garbage), one that consolidates redundant tests, one that parametrizes repetitive tests, and so on.

Agents create redundant tests for all sorts of reasons. Maybe they're trying a hard to reach line and leave several attempts behind. Or maybe they "get creative" and try to guess what is uncovered instead of actually following the coverage report, etc.

Less capable models are actually better at doing this. They're faster, don't "get creative" with weird ideas mid-task and cost less. Just make them work one test at the time. Spawn, do one test that verifiably increases overall coverage, exit. Once you reach a treshold, start the consolidating loop: pick a redundant pair of tests, consolidate, exit. And so on...

Of course, you can use a powerful model and babysit it as well. A few disambiguating questions and interruptions will guide them well. If you want true unattended though, it's damn hard to get stable results.


If you read my comment, I was describing the consolidation part.

We fixed this at work by instructing it to maximize coverage with minimal tests, which is closer to our coding style.

Those tests were written by people. That's why they were confident that what the LLM implemented was correct.

Meta about how important context is.

People see LLMs and tons of tests tests written in the same sentence, and think that shows how models love writing pointless tests. Rather than realizing that the tests are standard and people written to show that the model wrote code that is validated by a currently trusted source.

Shows the importance for us to always write comments that humans are going to read with the right context is _very_ similar to how we need to interact with LLMs. And if we fail to communicate with humans, clearly we're going to fail with models.


Yeah, we now need to specify who wrote the tests, because it's important information.

Yes

Skill issue... And perhaps the wrong model + harness


It's the semantics of "can", where it is used to suggest feasibility. When I moved and got a new commute, I still "could" bike to work, but it went from 30min to an hour and a half each way. While technically possible, I would have had to sacrifice a lot when losing two hours a day- laundry, cooking dinner, downtime. I always said I "can't really" bike to work, but there is a lot of context lost.

So you can, but don't want to.


"Can" is too overloaded a word even with context provided, ranging from places like "could conceivably be achieved" to "usually possible".

The only hint you can dig out is where they might have limits feasibility around it. E.g. "I can fly first class all the time (if I limit the number of flights and spend an unreasonable portion of my weath on tickets)" is typically less useful an interpretation than "I can fly first class all the time (frequently without concern, because I'm very well off)", but you have to figure out which they are trying to say (which isn't always easy).


I can't without seriously sacrificing productivity. (I've been coding for 30 years.)

What are you talking about? 5.2 literally just came out.

5.2-codex just came out. You could use codex with regular 5.2 for a week or so.

> It is still not great at debugging things.

It's so fascinating to me that the thread above this one on this page says the opposite, and the funniest thing is I'm sure you're both right. What a wild world we live in, I'm not sure how one is supposed to objectively analyse the performance of these things


Give them real world problems you're encountering and see which can solve them the best, if at all

A full week of that should give you a pretty good idea

Maybe some models just suit particular styles of prompting that do or don't match what you're doing


It's great at some things, and it's awful at other things. And this varies tremendously based on context.

This is very similar to how humans behave. Most people are great a small number of things, and there's always a larger set of things that we may individually be pretty terrible at.

The bots the same way, except: Instead of billions of people who each have their own skillsets and personalities, we've got a small handful of distinct bots from different companies.

And of course: Lies.

When we ask Bob or Lisa help with a thing that they don't understand very well, they usually will try to set reasonable expectations . ("Sorry, ssl-3, I don't really understand ZFS very well. I can try to get the SLOG -- whatever that is -- to work better with this workload, but I can't promise anything.")

Bob or Lisa may figure it out. They'll gather up some background and work on it, bring in outside help if that's useful, and probably tread lightly. This will take time. But they probably won't deliberately lie [much] about what they expect from themselves.

But when the bot is asked to do a thing that it doesn't understand very well, it's chipper as fuck about it. ("Oh yeah! Why sure I can do that! I'm well-versed in -everything-! [Just hold my beer and watch this!]")

The bot will then set forth to do the thing. It might fuck it all up with wild abandon, but it doesn't care: It doesn't feel. It doesn't understand expectations. Or cost. Or art. Or unintended consequences.

Or, it might get it right. Sometimes, amazingly-right.

But it's impossible to tell going in whether it's going to be good, or bad: Unlike Bob or Lisa, the bot always heads into a problem as an overly-ambitious pack of lies.

(But the bot is very inexpensive to employ compared to Bob or Lisa, so we use the bot sometimes.)


I always wonder how people make qualitative statements like this. There are so many variables! Is it my prompt? The task? The specific model version? A good or bad branch out of the non-deterministic solution space?

Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results? Not snark by the way, I’m asking in earnest how you pick one model over another.


> Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results?

This is what I do. I have a little TUI that fires off Claude Code, Codex, Gemini, Qwen Coder and AMP in separate containers for most task I do (although I've started to use AMP less and less), and either returns the last message of what they replied and/or a git diff of what exactly they did. Then I compare them side by side. If all of them got something wrong, I update the prompt, fire them off again. Always starting from zero, and always include the full context of what you're doing with the first message, they're all non-interactive sessions.

Sometimes I do 3x Codex instead of different agents, just to double-check that all of them would do the same thing. If they go off and do different things from each other, I know the initial prompt isn't specific/strict enough, and again iterate.


Please share! I'd much rather help develop your solution than vibe code one of my own ))

Honestly, I'd love to try that. My Gmail username is the same as my HN username.


Not the OP but I have https://github.com/nlothian/autocoder which supports a Github-centric workflow using the following options:

  - Claude
  - Codex
  - Kilocode
  - Amp
  - Mistral Vibe
Very vibe coded though.

What's this costing you?

So how do the models compare in your experience?

I have sent the same prompt to GPT-5.2 Thinking and Gemini 3.0 Pro many times because I subscribe to both.

GPT-5.2 Thinking (with extended thinking selected) is significantly better in my testing on software problems with 40k context.

I attribute this to thinking time, with GPT-5.2 Thinking I can coax 5 minutes+ of thinking time but with Gemini 3.0 Pro it only gives me about 30 seconds.

The main problem with the Plus sub in ChatGPT is you can't send more than 46k tokens in a single prompt, and attaching files doesn't help either because the VM blocks the model from accessing the attachments if there's ~46k tokens already in the context.


Last night I gave one of the flaky tests in our test suite to three different models, using the exact same prompt.

Gemini 3 and Gemini 3 Flash identified the root cause and nailed the fix. GPT 5.1 Codex misdiagnosed the issue and attempted a weird fix despite my prompt saying “don’t write code, simply investigate.”

I run these tests regularly, and Codex has not impressed me. Not even once. At best it’s on par, but most of the time it just fails miserably.

Languages: JavaScript, Elixir, Python


The one time I was impressed with codex was when I was adding translations in a bunch of languages for a business document generation service. I used claude to do the initial work and cross checked with codex.

The codex agent ran for a long time and created and executed a bunch of python scripts (according to the output thinking text) to compare the translations and found a number of possible issues. I am not sure where the scripts were stored or executed, our project doesn't use python.

Then I fed the output of the issues codex found to claude for a second "opinion". Claude said that the feedback was obviously from someone that knew the native language very well and agreed with all the feedback.

I was really surprised at how long Codex was thinking and analyzing - probably 10 minutes. (This was ~1+mo ago, I don't recall exactly what model)

Claude is pretty decent IMO - amp code is better, but seems to burn through money pretty quick.


I have the same experience. To make it worse, there’s a mile of difference between the all too many versions and efforts..

This works for me in general. If I am procrastinating, I ask a coding agent for a small task. If it works, I have something to improve upon. If it doesn’t work, my OCD forces me to “fix it.” :D

Same actually. Though, for some reasons Codex utterly falls down with podman, especially rootless podman. No matter how many explicit instructions I give it in the prompt and AGENTS.md, it will try to set a ton of variables and break podman. It will then try use docker (again despite explicit instructions not too) and eventually will try to sudo podman. One time I actually let it, and it reused its sudo perms to reconfigure selinux on my system, which completely broke it so that I could no longer get root on my own machine and the machine never booted again (because selinux was blocking everything). It has tried to do the same thing three times now on different projects.

So yeah, I use codex a lot and like it, but it has some really bad blind spots.


> One surprising thing that codex helped with is procrastination.

Heh. It's about the same as an efficient compilation or integration testing process that is long enough to let it do it's thing while you go and browse Hacker News.

IMHO, making feedback loops faster is going to be key to improving success rates with agentic coding tools. They work best if the feedback loop is fast and thorough. So compilers, good tests, etc. are important. But it's also important that that all runs quickly. It's almost an even split between reasoning and tool invocations for me. And it is rather trigger happy with the tool invocations. Wasting a lot of time to find out that a naive approach was indeed naive before fixing it in several iterations. Good instructions help (Agents.md).

Focusing attention on just making builds fast and solid is a good investment in any case. Doubly so if you plan on using agentic coding tools.


On the contrary, I will always use longer feedback cycle agents if the quality is better (including consulting 5.2 Pro as oracle or for spec work).

The key is to adapt to this by learning how to parallelize your work, instead of the old way of doings things where devs are expected to focus on and finish one task at a time (per lean manufacturing principles).

I find now that painfully slow builds are no longer a serious issue for me. Because I'm rotating through 15-20 agents across 4-6 projects so I always have something valuable to progress on. One of these projects and a few of these agents are clear priorities I return to sooner than the others.


> One surprising thing that codex helped with is procrastination.

The Roomba effect is real. The AI models do all the heavy implementation work, and when it asks me to setup an execute tests, I feel obliged to get to it ASAP.


I have similar experiences with Claude Code ;) Have you used it as well? How does it compare?

I think Opus + Claude Code is the more competent overall general "making things" system, while it makes sense to have a $20 Codex subscription to find bugs and review the things that Claude Code makes.

On its own, as sole author, I find Codex overcomplicates things. It will riddle your code with unnecessary helper functions and objects and pointless abstractions.

It is however useful for doing a once over for code review and finding the things that Claude rushed through.


Is it just a Bluetooth mic in a form of a ring? Or is there something more to this device?

Yes it says it can store 5 minutes of audio

We've had purpose built machines for a while now. I think the whole point is to have an adaptable machine that can replace remaining humans.


Read the article, but couldn't understand how they measured it.

To be fair, I think it's definitely a bubble, but it's hard to compare something like this.


It's a click-baity title, for sure.


From the official docs it sounds more like experimental support that's still under development.


I'm kind of surprised by this. Google is already under a lot of heat, especially in Europe. All sorts of lawsuits everywhere because of they monopoly abuse. And they decide to pull this move?


OTOH, it gives more options to implement Cyber Resilience Act requirements, especially once the boundaries get mapped out in real life


New EU laws are kinda requiring Google to do this


Which ones?


Chatcontrol in particular. If you control your phone it's going to be trivial to bypass it.


Chatcontrol isn't there yet.


No of course but it will be.

There's another vote on the 17th of October and most countries are in favour now :( And if it fails again I'm sure they will keep trying like they have been until they can finally push it through.

Notably in this iterations the politicans are making an exemption for themselves and their servants (including police etc).

But I think Google thinks the time is right now because it will be a prerequisite for this.


Eagerly awaiting Apple doing the same on Macs then. Let alone any Linux distribution.


PCs aren't phones, but those might get there too some day


[flagged]


Yes and no. Yes billionaires are not people to be trusted but also they are a structural problem, a CEO that does not squeeze last the last penny out of users for shareholder value is just not doing their job. Billionaires are a-holes because the corporate incentive system rewards people like them. We need new structures.


> , a CEO that does not squeeze last the last penny out of users for shareholder value is just not doing their job

This has not always been the case. And still isn't in plenty of locales and companies. The S&P 500 of 2025 doesn't define immutable universal laws.


Dodge v. Ford Motor Co. was in 1919.


I was not aware of that! Basically you can get sued if you are caught prioritizing employees or customers over shareholders?


I suspect this will penalize your site in one way or another.


Authoritarian rule can work until it doesn't. It can even work better than democracy for some time because decisions can be made quickly. The problem that when it doesn't, there's no path for self-correction.


I’m all for democracy, but we need to find a way for long-term commitments for nation building. If every head of state will wipe out 4-5 years of previous leadership’s work, you really can’t go far in today’s world.


That isn't a problem with every head of state, it only seems to be a problem with the US where social cohesion and national identity no longer exist, and partisan politics overrides everything. Other democracies don't have the all-powerful executive the US does, nor the innate hate and fear of effective government that the US was founded on, leading to a government designed to maximize friction and gridlock.

The US is dragged down by an archaic political system designed for a pre-industrial society of slavers that immediately devolved into a two-party binary of entrenched elites - a system the US doesn't even spread when it does nation building because it's so fundamentally broken.

So yeah, the solution here is just don't be like the US.


Where isn’t it a problem? Even here in Japan, a rising opposing party is campaigning on reverse-LDP, and gaining support. Same everywhere in Europe.


It seems to be more of a problem in the US than elsewhere, but I admit I may be biased by living here and experiencing it firsthand. I don't see Japan tearing down its medical and research infrastructure, for instance, or their government grifting crypto, or doing half the clownshit things the US is. The UK may be getting there, I don't know.


Do we have a democracy though? If so many politicians are bought by special interests, does our system of governance allow for any path for self-correction?


As always, the truth is somewhere in the middle. AI is not going to replace everyone tomorrow, but I also don't think we can ignore productivity improvements from AI. It's not going to replace engineers completely now or in the near future, but AI will probably reduce the number of engineers needed to solve a problem.


How can you actually verify it, even if they provide something?


That's my point; you can't. They have no idea if their model came up with any of this or not.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: