Everyone keeps saying 80/20 but that undersells what's going on. The last 20% isn't just hard. It's hard because of what happened during the first 80%.
When an agent takes a shortcut early on, the next step doesn't know it was a shortcut. It just builds on whatever it was handed. And then the step after that does the same thing. So by hour 80 you're sitting there trying to fix what looks like a UI bug and you realize the actual problem is three layers back. You're not doing the "hard 20%." You're paying interest on shortcuts you didn't even know were taken. (As I type this I'm having flashbacks to helping my kid build lego sets.)
The author figured this out by accident. He stopped prompting and opened Figma to design what he actually wanted. That's the move. He broke the chain before the next stage could build on it. The 100 hours is what it costs when you don't do that.
This is how all software projects play out. The difference is when it's people we call it tech debt or bad desing and then start a project to refactor.
Apparently LLMs break some devs brains though. Because it's not one shot perfect they throw their hands in the air claim AI can't ever do it and move on, forgetting all those skills they (hopefully) spent years building to manage complex software. Of course a newbie vibe coder won't know this but an experienced developer should.
Except when you've worked on building the software yourself instead of getting the LLM to do it, you have a loooooooot of built-up context that you can use to know why decisions were made, to debug faster, and to get things done more efficiently.
I can look at code I wrote years ago and have absolutely no memory of writing it, but I know its my code and I know where some of the warts and traps are. I can answer questions about why things work a certain way.
With an LLM, you don't get that. You're basically starting from scratch when it comes to solving any problem or answering any question.
Nice, I've been working on the same problem from a different direction. Instead of analyzing sessions after the fact, I built a pipeline that structures them. Stages (plan, design, code, review, same as you'd have with humans) with gates in between.
The gates categorize issues into auto-fix or human-review. Auto-fix gets sent back to the coding agent, it re-reviews, and only the hard stuff makes it to me. That structure took me from about 73% first-pass acceptance to over 90%.
What I've been focused on lately is figuring out which gates actually earn their keep and which ones overlap with each other. The session-level analytics you're building would be useful on top of this, I don't have great visibility into token usage or timing per stage right now.
This is great. How are you "identifying" these stages in the session? Or is it just different slash commands / skills per stage?
If its something generic enough, maybe we can build the analysis into it, so it works for your use case. Otherwise feel free to fork the repo, and add your additional analysis. Let me know if you need help.
I use prompt templates, so in the first version of my analysis script on my own logs I looked for those. However, to make it generic, I switched to using gemini as a classifier. That's what's in the repo.
Senior review can definitely help, regardless if the code comes from a junior or an LLM. We've done this since the dawn of time. However, it doesn't scale and since LLM volume far exceeds what juniors can do, you end up overwhelming the seniors, who are normally overbooked anyway.
The other problem is that the type of errors LLMs make are different than juniors. There are huge sections of genuinely good code. So the senior gets "review fatigue" because so much looks good they just start rubber stamping.
I use an automated pipeline to generate code (including terraform, risking infrastructure nukes), and I am the senior reviewer. But I have gates that do a whole range of checks, both deterministic and stochastic, before it ever gets to me. Easy things are pushed back to the LLM for it to autofix. I only see things where my eyes can actually make a difference.
Amazon's instinct is right (add a gate), but the implementation is wrong (make it human). Automated checks first, humans for what's left.
The disposition problem you describe maps to something I keep running into. I've been running fully autonomous software development agents in my own harness and there's real tension between "check everything" and "agent churns forever".
It'a a liveness constraint: more checks means less of the agent output can pass. Even if the probabilistic mass of the output centers around "correct", you can still over-check and the pipeline shuts down.
The thing I noticed: the errors have a pattern and you can categorize them. If you break up the artifact delivery into stages, you can add gates in between to catch specific classes of errors. You keep throughput while improving quality. In the end, instead of LLMs with "personas", I structured my pipeline around "artifact you create".
Everyone is circling around this. We are shifting to "code factories" that take user intent in at one end and crank out code at the other end. The big question: can you trust it?
We're building our tooling around it (thanks, Claude!) and seeing what works. Personally, I have my own harness and I've been focused on 1) discovering issues (in the broadest sense) and 2) categorizing the issues into "hard" and "easy" to solve inside the pipeline itself.
I found patterns in the errors the coding agents made in my harness, which I then exploited. I have an automated workflow that produces code in stages. I added structured checks to catch the "easy" problems at stage boundaries. It fixes those automatically. It escalates the "hard" problems to me.
In the end, this structure took me from ~73% first-pass to over 90%.
Yeah, this is what happens when there's nothing between "the agent decided to do this" and "it happened." The agent followed the state file logically. It wasn't wrong. It just wasn't checked.
His post-mortem is solid but I think he's overcorrecting. If he does this as part of a CICD pipeline and he manually reviews every time, he will pretty quickly get "verification fatigue". The vast majority of cases are fine, so he'll build the habit of automatically approving it. Sure, he'll deeply review the first ones, but over time it becomes less because he'll almost always find nothing. Then he'll pay less attention. This is how humans work.
He could automate the "easy" ones, though. TF plans are parseable, so maybe his time would be better spent only reviewing destructive changes. I've been running autonomous agents on production code for a while and this is the pattern that keeps working: start by reviewing everything, notice you're rubber-stamping most of it, then encode the safe cases so you only see the ones that matter.
Or just never run agents on anything that touches production servers. That seems extremely obvious to me. He let Claude control terminal commands which touched his live servers.
That's very different than asking it for help to make a plan.
But the CEOs are saying everyone is going to be replaced by LLMs in 6 months. Surely that means they're capable of handling production environments without oversight from a professional.
they're doing as well as professionals do without oversight on production environments. There's no lack of stories about people deleting their production environments with data loss too.
the fix has always been to limit what can be done directly to prod, and put it through both review, and tests before a change can touch production.
> they're doing as well as professionals do without oversight on production environments
The difference is that if a human does it there usually is done accountability, you’ll be asked how it happened and expected to learn from it. And if you do it again your social score goes down, nobody will trust you and you’ll be consider a liability.
If a cli tool does it the outcome is different, you might stop saying the tool or you might blame yourself for not giving the tool enough context. And if it does it again you might just shrug it off with “well of course, it’s just a tool”.
Accountability according to reputation is exactly what is happening for AI providers. All these articles about Claude destroying systems makes people trust Claude less, and maybe even “fire” Claude by choosing another AI provider with better safeguards or low privileges built in.
So you're saying they need oversight... from a professional. Preferably someone with years of experience and domain expertise, who knows how to not fuck everything up?
Almost every software engineer seems to agree on that point. Not believing marketing hype is standard practice in this industry because plenty of us are inherently techno-optimists who have been burned by over-belief in the past.
Regardless it is hard to dismiss the fact AI is making it easier for randoms to develop software. And it will keep getting better the more integrated and controlled it gets.
If Hackernews is to be taken as a representative crosss section of the industry, I disagree. I've seen plenty of people on here so hyped it boarders on hysteria. I work with a couple of senior devs who have gotten downright weird about it.
Maybe HN leans more toward the hobbiest and student side then it does industry professionals, I don't know, but you don't have to look far to find someone who swears up and down you can run a couple agents in a loop and have it build multi million line code bases with little to no oversight.
> they're doing as well as professionals do without oversight on production environments.
That's nonsense. First, most people haven't deleted the production environment by accident. They have enough sense to recognize that as a dangerous thing and will pause to think about it. Second, the ones who do make that mistake learn and won't make it again, which is not something the clanker is capable of.
The article says that Claude did recognize the danger, and advised the developer to run a safer setup with no risk of the two websites stomping on each other's resources, but he overrode it. I've definitely seen situations in my career where a junior developer does something dangerous and destructive after a senior dev overrode guardrails meant to prevent it. (None quite this bad, but then again I've never worked on small sites.)
Are agents clever enough to seek and maybe use local privilege escalations? It seems like they should always run as their own user account with no credentials to anything, but I wonder if they will try to escape it somehow...
Yes, absolutely. I often see agents trying to 'sudo supervisorctl tail -f <program_name>', which fails because I don't give them sudo access. Then they realize they can just 'cat' the logfile itself and go ahead and do that.
Sometimes they realize their MCP doesn't have access to something, so they pull an API Token for the service from the env vars of either my dev laptop, or SSH into one of the deployed VM's using keys from ~/.ssh/ and grab the API Token from the cloud VM's and then generate a curl command to do whatever they weren't given access to do.
Simple examples, but I've seen more complex workarounds too.
Just use a normal spare vps or run things in proper virtual machines depending on what you prefer. There are some projects like exe.xyz (invites closed it seems)
Sprite.dev from fly.io is another good one that I had heard sometime ago. I am hearing less about it but it should only cost for when the resources are utilized which is a pretty cool concept too.
I've been running a multi-agent setup for quite a while to do software development. I set up a workflow with agents at each stage, spec->plan->design->code->review. The key thing I learned was that the arrangement of the checks between agents matters more than which model you pick for any one step. Most failures were omissions that a gate between stages catches.
I ended up building my own for this. SQLite backend, breaks work into stages with gates between them. A gate checks each handoff before the next stage starts. Does the code match the spec, did the tests pass, that kind of thing.
I've been running it with Claude Code for about four months now. What I didn't expect was how much the gates matter relative to the model itself. Most of the failures I see aren't hallucinations, they're omissions, and a structured check catches those easily.
The checkpoint pattern you describe is exactly right. I've been dealing with this as well. Instead of vibe coding, it's vibe system engineering and I don't care for it. So I thought about it and came up with a framework to describe and reason about different pipelines. I based it on the types of LLM failures I was seeing in my own pipeline (omissions, incorrect, or inconsistent with existing stuff).
I wanted something I could use to objectively decide if one test (or gate, as I call them) is better than another, and how do they work as a holistic system.
My personal tool encodes a workflow that has stages and gates. The gates enforce handoff. Once I did this I went from ~73% first-pass approval to over 90% just by adding structured checks at stage boundaries.
I'm old enough to remember that engineers researching distributed systems had the same challenge. Everyone was trying to build 100% reliable nodes, which is impossible. Then Lamport came along and showed you could actually achieve your goal at the protocol/system level.
What you're describing here is a workflow or pipeline, which is the analogy. As the LLMs produce artifacts, you have gates that verify the output deterministically. If the LLM breaks a rule, you either throw it out and reroll or you give it the feedback and let it revise.
I do this in my own tooling and I get great results. One thing from the data: they are often pretty crap at revising, spending ridiculous time/tokens in a revision loop. I'm trying to find the right balance of reroll/revise myself.
When an agent takes a shortcut early on, the next step doesn't know it was a shortcut. It just builds on whatever it was handed. And then the step after that does the same thing. So by hour 80 you're sitting there trying to fix what looks like a UI bug and you realize the actual problem is three layers back. You're not doing the "hard 20%." You're paying interest on shortcuts you didn't even know were taken. (As I type this I'm having flashbacks to helping my kid build lego sets.)
The author figured this out by accident. He stopped prompting and opened Figma to design what he actually wanted. That's the move. He broke the chain before the next stage could build on it. The 100 hours is what it costs when you don't do that.
reply