Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What kinds of things are you building? This is not my experience at all.

Just today I asked Claude using opus 4.6 to build out a test harness for a new dynamic database diff tool. Everything seemed to be fine but it built a test suite for an existing diff tool. It set everything up in the new directory, but it was actually testing code and logic from a preexisting directory despite the plan being correct before I told it to execute.

I started over and wrote out a few skeleton functions myself then asked it write tests for those to test for some new functionality. Then my plan was to the ask it to add that functionality using the tests as guardrails.

Well the tests didn’t actually call any of the functions under test. They just directly implemented the logic I asked for in the tests.

After $50 and 2 hours I finally got something working only to realize that instead of creating a new pg database to test against, it found a dev database I had lying around and started adding tables to it.

When I managed to fix that, it decided that it needed to rebuild multiple docker components before each test and test them down after each one.

After about 4 hours and $75, I managed to get something working that was probably more code than I would have written in 4 hours, but I think it was probably worse than what I would have come up with on my own. And I really have no idea if it works because the day was over and I didn’t have the energy left to review it all.

We’ve recently been tasked at work with spending more money on Claude (not being more productive the metric is literally spending more money) and everyone is struggling to do anything like what the posts on HN say they are doing. So far no one in my org in a very large tech company has managed to do anything very impressive with Claude other than bringing down prod 2 days ago.

Yes I’m using planning mode and clearing context and being specific with requirements and starting new sessions, and every other piece of advice I’ve read.

I’ve had much more luck using opus 4.6 in vs studio to make more targeted changes, explain things, debug etc… Claude seems too hard to wrangle and it isn’t good enough for you to be operating that far removed from the code.

 help



Similar experience. I use these AI tools on a daily basis. I have tons of examples like yours. In one recent instance I explicitly told it in the prompt to not use memcpy, and it used memcpy anyway, and generated a 30-line diff after thinking for 20 minutes. In that amount of time I created a 10-line diff that didn't use memcpy.

I think it's the big investors' extremely powerful incentives manifesting in the form of internet comments. The pace of improvement peaked at GPT-4. There is value in autocomplete-as-a-service, and the "harnesses" like Codex take it a lot farther. But the people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away. This is not a hockey stick curve. It's a log curve.

Bigger context windows are a welcome addition. And stuff like JSON inputs is nice too. But these things aren't gonna like, take your SWE job, if you're any good. It's just like, a nice substitute for the Google -> Stack Overflow -> Copy/Paste workflow.


Most devs aren't very good. That's the reality, it's what we've all known for a long time. AI is trained on their code, and so these "subpar" devs are blown away when they see the AI generate boring, subpar code.

The second you throw a novel constraint into the mix things fall apart. But most devs don't even know about novel constraints let alone work with them. So they don't see these limitations.

Ask an LLM to not allocate? To not acquire locks? To ensure reentrancy safety? It'll fail - it isn't trained on how to do that. Ask it to "rank" software by some metric? It ends up just spitting out "community consensus" because domain expertise won't be highly represented in its training set.

I love having an LLM to automate the boring work, to do the "subpar" stuff, but they have routinely failed at doing anything I consider to be within my core competency. Just yesterday I used Opus 4.6 to test it out. I checked out an old version of a codebase that was built in a way that is totally inappropriate for security. I asked it to evaluate the system. It did far better than older models but it still completely failed in this task, radically underestimating the severity of its findings, and giving false justifications. Why? For the very obvious reason that it can't be trained to do that work.


The people glazing these tools can't design systems. I have this founder friend who I've known for decades, he knows how to code but he isn't really interested in it; he's more interested in the business side and mostly sees programming as a way to make money. Before ChatGPT he would raise money and hire engineers ASAP. When not a founder he would try to get into management roles etc etc. About a year ago he told me he doesn't really write code anymore, and he showed me part of his codebase for this new company he's building. To my horror I saw a 500-line bash script that he claimed he did not understand and just used prompts to edit it.

It didn't need to be a bash script. It could have been written in any scripting language. I presume it started off as a bash script because that's probably what it started out as when he was exploring the idea. And since it was already bash I guess he decided to just keep going with it. But it was just one of those things where I was like, these autocomplete services would never stop and tell you "maybe this 500-line script should be rewritten in python", it will just continue to affirm you, and pile onto the tech debt.

I used to freak out and think my days were numbered when people claimed they stopped writing code. But now I realize that they don't like writing code, don't care about getting better at it, don't know what good code looks like, and would hire an engineer if they could. With that framing, whenever I see someone say "Opus 4.6 is nuts. Everything I throw at it works. Frontend, backend, algorithms—it does not matter." I know for a fact that "everything" in that persons mind is very limited in scope.

Also, I just realized that there was an em-dash in that comment. So there's that. Wasn't even written by a person.


> I know for a fact that "everything" in that persons mind is very limited in scope.

I agree, and I think it's quite telling what people are impressed by. Someone elsewhere said that Opus 4.6 is a better programmer than they are and... I mean, I kinda believe it, but I think it says way more about them than it does about Opus.


Yep that's from the same comment I quoted. Decent chance it's not even a real person.

> people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away

Careful, or you're going to get slapped by the stupid astroturfing rule... but you're correct. Also there's the sunk cost fallacy, post purchase rationalization, choice supportive bias, hell look at r/MyBoyfriendIsAI... some people get very attached to these bots, they're like their work buddies or pets, so you don't even need to pay them, they'll glaze the crap out it themselves.


Curious what language and stack. And have people at your company had marginally more success with greenfield projects like prototypes? I guess that’s what you’re describing, though it sounds like it’s a directory in a monorepo maybe?

This was in Go, but my org also uses Typescript, and Elixir.

I’ve had plenty of success with greenfield projects myself but using the copilot agent and opus 4.5 and 4.6. I completely vibecoded a small game for my 4 year old in 2 hours. It’s probably 20% of the way to being production ready if I wanted to release it, but it works and he loves it.

And yes people have had success with very simple prototypes and demos at work.


Try https://github.com/gsd-build/get-shit-done. It's been a game changer for me.

> After about 4 hours and $75

Huh? The max plan is $200/month. How are you spending $75 in 4 hrs?


Enterprise plan. We've been instructed that our goal is to spend at least as much as our salary.

> is to spend at least as much as our salary

Reads as a very distopian "let's see how many people we can replace"


You probably just don't have the hang of it yet. It's very good but it's not a mind reader and if you have something specific you want, it's best to just articulate that exactly as best you can ("I want a test harness for <specific_tool>, which you can find <here>"). You need to explain that you want tests that assert on observable outcomes and state, not internal structure, use real objects not mocks, property based testing for invariants, etc. It's a feedback loop between yourself and the agent that you must develop a bit before you start seeing "magic" results. A typical session for me looks like:

- I ask for something highly general and claude explores a bit and responds.

- We go back and forth a bit on precisely what I'm asking for. Maybe I correct it a few times and maybe it has a few ideas I didn't know about/think of.

- It writes some kind of plan to a markdown file. In a fresh session I tell a new instance to execute the plan.

- After it's done, I skim the broad strokes of the code and point out any code/architectural smells.

- I ask it to review it's own work and then critique that review, etc. We write tests.

Perhaps that sounds like a lot but typically this process takes around 30-45 minutes of intermittent focus and the result will be several thousand lines of pretty good, working code.


I absolutely have the hang of Claude and I still find that it can make those ridiculous mistakes, like replicating logic into a test rather than testing a function directly, talking to a local pg that was stale/ running, etc. I have a ton of skills and pre-written prompts for testing practices but, over longer contexts, it will forget and do these things, or get confused, etc.

You can minimize these problems with TLC but ultimately it just will keep fucking up.


My favorite is when you need to rebuild/restart outside of claude and it will "fix the bug" and argue with you about whether or not you actually rebuilt and restarted whatever it is you're working on. It would rather call you a liar than realize it didn't do anything.

this is a pretty annoying problem -- i just intentionally solve it by asking claude to always use the right build command after each batch of modifications, etc

"That's an old run, rebuild and the new version will work" lol

Don't know what to tell you. Sounds like you're holding it wrong. Based on the current state of things I would try to get better at holding it the right way.

I can't tell if you're joking?

With the back and forth refining I find it very useful to tell Claude to 'ask questions when uncertain' and/or to 'suggest a few options on how to solve this and let me choose / discuss'

This has made my planning / research phase so much better.


Yes pretty much my workflow. I also keep all my task.md files around as part of the repo, and they get filled up with work details as the agent closes the gates. At the end of each one I update the project memory file, this ensures I can always resume any task in a few tokens (memory file + task file == full info to work on it).

Pretty good workflow. But you need to change the order of the tests and have it write the tests first. (TDD)

I mean I’ve been using AI close to 4 years now and I’ve been using agents off and on for over a year now. What you’re describing is exactly what I’m doing.

I’m not seeing anyone at work either out of hundreds of devs who is regularly cranking out several thousand lines of pretty good working code in 30-45 minutes.

What’s an example of something you built today like this?


Fair, that's optimistic, and it depends what you're doing. Looking at a personal project I had a PR from this week at +3000 -500 that I feel quite good about, took about 2 nights of about an hour each session to shape it into what I needed (a control plane for a polymarket trading engine). Though if I'm being fair, this was an outlier, only possible because I very carefully built the core of the engine to support this in advance - most of the 3K LoC was "boilerplate" in the sense I'm just manipulating existing data structures and not building entirely new abstractions. There are definitely some very hard-fought +175 -25 changes in this repo as well.

Definitely for my day job it's more like a few hundred LoC per task, and they take longer. That said, at work there are structural factors preventing larger changes, code review, needing to get design/product/coworker input for sweeping additions, etc. I fully believe it would be possible to go faster and maintain quality.


Those numbers are much more believable, but now we’re well into maybe a 2-3x speed up. I can easily write 500 LOC in an hour if I know exactly what I’m building (ignoring that LOC is a terrible metric).

But now I have to spend more time understanding what it wrote, so best case scenario we’re talking maybe a 50% speed up to a part of my job that I spent maybe 10-20% on.

Making very big assumptions that this doesn’t add long term maintenance burdens or result in a reduction of skills that makes me worse at reviewing the output, it’s cool technology.

On par with switching to a memory managed language or maybe going from J2EE to Ruby on Rails.


Thinking in terms of a "speed up multiplier" undersells it completely. The speed up on a task I would have never even attempted is infinite. For my +3000 PR recently on my polymarket engine control plane, I had no idea how these types of things are typically done. It would have taken me many hours to think through an implementation and hours of research online to assemble an understanding on typical best practices. Now with AI I can dispatch many parallel agents to examine virtually all all public resources for this at once.

Basically if it's been done before in a public facing way, you get a passable version of that functionality "for free". That's a huge deal.


1. You think you have something following typical best practices. You have no way to verify that without taking the time to understand the problem and solution yourself.

2. If you’d done 1, you’d have the knowledge yourself next time the problem came up and could either write it yourself or skip the verifications step.

I’m not saying there aren’t problems out there where the problem is hard to solve but easy to verify. And for those use cases LLMs are terrific.

But many problems have the inverse property. And many problems that look like the first type are actually the second.

LLMs are also shockingly good at generating solutions that look plausible, independent of correctness or suitability, so it’s almost always harder to do the verification step than it seems.


The control plane is already operational and does what I need. Copying public designs solved a few problems I didn't even know I had (awkward command and control UX) and seems strictly superior to what I had before. I could have taken a lot longer on this - probably at least a week, to "deeply understand the problem and solution". But it's unclear what exactly that would have bought me. If I run into further issues I will just solve them at that time.

So what is the issue exactly? This pattern just seems like a looser form of using a library versus building from scratch.


For one I’d argue that you shouldn’t just use a library without understanding what it does and verifying it does what it says.

But a library has been used by multiple people who have verified that it does what it says it does as long as you pick something popular.

You have no idea what this code does. Maybe it has a huge security flaw? Or maybe it’s just riddled with bugs that you don’t know enough to expose.

Maybe it “follows best practices” that your agents uncovered or maybe it doesn’t.

If you expose customer data, or you fuck up in a way that costs customers money, the AI isn’t liable for that you are.

Now if this is just a toy app where no one can be harmed sure who cares.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: