In my view, Spec-Driven systems are doomed to fail. There's nothing that couples the english language specs you've written with the actual code and behaviour of the system - unless your agent is being insanely diligent and constantly checking if the entire system aligns with your specs.
This has been solved already - automated testing. They encode behaviour of the system into executables which actually tell you if your system aligns or not.
Better to encode the behaviour of your system into real, executable, scalable specs (aka automated tests), otherwise your app's behaviour is going to spiral out of control after the Nth AI generated feature.
The way to ensure this actually scales with the firepower that LLMs have for writing implementation is ensure it follows a workflow where it knows how to test, it writes the tests first, and ensures that the tests actually reflect the behaviour of the system with mutation testing.
Sort of agreed. Natural language specs don't scale. They can't be used to accurately model and verify the behavior of complex systems. But they can be used as a guide to create formal language specs that can be used for that purpose. As long as the formal spec is considered to be the ground truth, I think it can scale. But yeah, that means some kind of code will be required.. :)
Things like Github's speckit seems to have a fair amount of usage.
The idea that specs are code now, is one can effectively rebuild in the future with newer models. Test requirements could be defined upfront in the specs too, no?
I think natural language leaves too much room for ambiguities. If you treat it as code I expect you will run into frequent bugs and unintended side effects of LLM-authored changes as your software evolves. So I'm skeptical about this approach.
A formal language helps in this regard because it makes visible the inconsistencies that are hidden in the specifications.
Coding is difficult sometimes because it turns out the problem you are trying to solve is more difficult than expected (not because it's difficult to code).
Been building for a long time, and more specifically overseeing building in detail, which transfers interestingly to overseeing LLMs.
Just like with coworkers, providing the right amount of context (not too much, or too little) for the request to succeed is critical.
I shared similar views, but I have seen first hand (using in production myself) that specs, well done in a way for LLMs, can do development with AI that works. If something doesn't work out, you don't fix the code, you adjust the spec. Highly recommend watching doers on Youtube who are sharing screens.
Discovering a problem is more difficult than expected allows you to take more shots at it, quicker by adjusting the spec, for example and running again. We are used to just plowing ahead to make the code right, instead of improving/clarifying the ask/spec.
In my experience, when you sell expensive complex systems, customers are very worried about any differences in system behavior as a result of software updates.
When you implement a new feature with these tools, how do you convince yourself that existing system behavior remains unchanged?
When you have the code in front of you, atleast you can reason about the full system behavior before and after because code is unambiguous like that.
With spec driven development, the LLM can rewrite anything as long as it meets the spec. That's a problem if your customer relies on behavior that's written down ambiguously (or omitted entirely).
So, I think this is only going to work if you write specs with mathematical precision.. at which point you probably want to write them using a mathematical language.
I've built, integrated and sold expensive complex systems. They want it working, connected, and reliable. Lots of paths there.
Have you built with LLMs? I'm asking because I would refer to things from having something working on a complex code base.
Specifications, or inputs in a way are a new code. The added focus on documentation, before and after is a bonus too, and also helps with alignment.
Code styles/formats/philosophies can be documented and followed.
The human process of what to look into, in what way, for what areas of the code base, can also be trained and remembered. There are ways to achieve and maintain precision without 100% mathematical precision, because there are only so many ways to solve a problem, or step and the mechanisms for deciding can also be defined in general, or specific.
I build with LLMs all the time but I generally don't do vibe coding unless it's something small I don't really care about.
When I look at SpecKit, I see a kind of vibe coding fantasy: "code is no longer king", stop writing "undifferentiated code." There is no code on the site, just a bunch of prompts and commands.
On the other hand, what you are describing above is bringing specs closer to the codebase, while not replacing the code itself. Like I said I have no problems using natural language as a guide (even as a primary guide). I also completely agree that it helps with documentation.
My main point is: if you want to maintain a complex system, you also need to have an accurate description of the system behavior in some kind of formalism.
This kind of description reflects the true system behavior better. It's more helpful when you need to predict the impact of changes and also during debugging.
Working for what? Can you show some complex systems that are built with it? Their site only mentions a kanban board app and a photo album. I can believe it works for that.
Spec Driven Development is a curious term - it suggests it is a kind of, or at least in the tradition of, Test Driven Development but it goes in the opposite direction!
1. Specs are subject to bit-rot, there's no impetus to update them as behaviour changes - unless your agent workflow explicitly enforces a thorough review and update of the specs, and unless your agent is diligent with following it. Lots of trust required on your LLM here.
2. There's no way to systematically determine if the behaviour of your system matches the specs. Imagine a reasonable sized codebase - if there's a spec document for every feature, you're looking at quite a collection of specs. How many tokens need be burnt to ensure that these specs are always up to date as new features come in and behaviour changes?
3. Specs are written in English. They're ambiguous - they can absolutely serve the planning and design phases, but this ambiguity prevents meaningful behaviour assertions about the system as it grows.
Contrast that with tests:
1. They are executable and have the precision of code. They don't just describe behaviour of the system, they validate that the system follows that behaviour, without ambiguity.
2. They scale - it's completely reasonable to have extensive codebases have all (if not most) of their behaviour covered by tests.
3. Updating is enforcable - assuming you're using a CI pipeline, when tests break, they must be updated in order to continue.
4. You can systematically determine if the tests fully describe the behaviour (ie. is all the behaviour tested) via mutation testing. This will tell you with absolute certainty if code is tested or not - do the tests fully describe the system's behaviour.
That being said, I think it's very valuable to start with a planning stage, even to provide a spec, such that the correct behaviour gets encoded into tests, and then instantiated by the implementation. But in my view, specs are best used within the design stage, and if left in the codebase, treated only as historical info for what went into the development of the feature. Attempting to use them as the source of truth for the behaviour of the system is fraught.
And I guess finally, I think that insofar as any framework uses the specs as the source of truth for behaviour, they're going to run into alignment problems since maintaining specs doesn't scale.
SDD is about flowing the design choices from the spec into the rest of the system. TDD was for making sure that the inevitable changes you make to the system later don't break your earlier assumptions - or at least warn that you need to change them. Personally I don't buy TDD - it might be useful sometimes - but it is kind of extreme - but in general agile methodologies were a reaction to the waterfall model of system development.
This is just one way to use TDD. I personally get the most value from TDD as a design approach. I iteratively decompose the project into stubbed, testable components as I start the project, and implement when I have to to get my tests to pass. At each stage I'm asking myself questions like "who needs to call who? with what data? What does it expect back as a return value?" etc.
> This has been solved already - automated testing.
This is specious reasoning. Automated tests are already the output of these specs, and specs cover way more than what you cover with code.
Framing tests as the feedback that drives design is also a baffling opinion. Without specialized prompts such as specs, you LLM agent of choice ends up either ignoring tests altogether or even changing them to fit their own baseless assumptions.
I mean, who hasn't stumbled upon the infamous "the rest of your tests go here" output in automated tests?
> Automated tests are already the output of these specs, and specs cover way more than what you cover with code.
ok but how are you sure that the AI is correctly turning the spec into tests. if it makes a mistake there and then builds the code in accordance with the mistaken test you only get the Illusion of a correct implementation
I'm sorry you feel like that. How would you phrase an observation where you find the rationale for an assertion to not be substantiated and supported beyond surface level?
You'd be surprised - I know I was - you can encode Test-Driven development into workflows that agents actually follow. I wrote an in-depth guide about this and have a POC for people to try over here: https://www.joegaebel.com/articles/principled-agentic-softwa...
I've found the best way to achieve that is to force the agent to do TDD. Better to get it to do Outside-in TDD. Even better to get it to run Outside-in TDD, then use mutation testing to ensure it has fully covered the logic.
Actually, every single word was hand typed. You'll probably notice that in areas where I could improve my grammar. It's understandable that you hit the wall of text and felt a bit dismayed by the length- hence the TLDR at the top and the example repo :)
I've been able to encode Outside-in Test Driven Development into a repeatable workflow. Claude Code follows it to a T, and I've gotten great results. I've written about it more here, and created a repo people can use out of the box to try it out:
> When asking Claude Code to write tests, I find they are inevitably coupled to implementation details, mockist, brittle, and missing coverage.
Interestingly, I haven't noticed any of that so far, using Claude Code on a new-ish project (couple 10k loc). However, I also went out of my way in my CLAUDE.md to instruct it to write functional code, avoid side effects / push side effects to the shell (functional core, imperative shell), avoid mocks in tests, etc. etc.
That's the kicker - you taught it conventions that it follows. Even better (in my view) is to ensure you're getting higher coverage and tests focused on behaviour by getting it to write the tests first.
Even moreso by ensuring it writes "feature complete" tests for each feature first.
Even moreso by running mutation testing to backfill tests for logic it didn't cover.
I'm surprised to see the lack of support in the comments. In my take, this article was very well written and goes a long way to describe the refreshing and enlivening aftermath of a peak experience with Psilocybin. It also mirrors the experience of the research subjects at Johns Hopkins.
One study, conducted on terminally ill cancer patients, found that most of them rated it as one of the most important experiences of their lives [1], rating it alongside the birth of their children, their wedding day, etc. Additionally, the depression easing effects were shown to persist up to a year by this study.
I wouldn't be surprised if experiencing one of the most important, enliving, and connective experiences of your life wouldn't go on to reduce depression and anxiety for the rest of your life.
This has been solved already - automated testing. They encode behaviour of the system into executables which actually tell you if your system aligns or not.
Better to encode the behaviour of your system into real, executable, scalable specs (aka automated tests), otherwise your app's behaviour is going to spiral out of control after the Nth AI generated feature.
The way to ensure this actually scales with the firepower that LLMs have for writing implementation is ensure it follows a workflow where it knows how to test, it writes the tests first, and ensures that the tests actually reflect the behaviour of the system with mutation testing.
I've scoped this out here [1] and here [2].
[1] https://www.joegaebel.com/articles/principled-agentic-softwa... [2] https://github.com/JoeGaebel/outside-in-tdd-starter
reply