• Neatly-formatted lists Neatness could be a sign of a machine, or it could be a sign of a diligent human author.
• Subtitles only a committee would come up with That seems to me like a matter of opinion and taste — and we all have different tastes.
• Emojis preceding every statement I counted three emoji pull quotes in a multi-page document. I suppose it could be an LLM, but it could also just be a nice style.
• Em-dashes and ‘it isn’t X, it’s Y' This is why I posted in the first place, and downvoted you. There is nothing wrong with em-dashes — I love them. I use them a lot. Frankly, I probably overuse them. I’ve used them since I was a kid: I am going to use them — and over-use them — as long as I live. As for ‘Love isn’t a feeling you wait to have — it’s a series of actions you choose to take,’ that just seems like normal English to me.
It’s very possible in 2025 that the article was LLM-written, or written by a man and cleaned up by an LLM, or written by a man and proofread by an LLM, or written by a man. It does not have the stilted feel of most LLM works to me, but I might just miss it.
An em-dash isn’t an indicator of an LLM — it’s a sign of someone who discovered typography early.
Rug pulls from foundation labs are one thing, and I agree with the dangers of relying on future breakthroughs, but the open-source state of the art is already pretty amazing. Given the broad availability of open-weight models within under 6 months of SotA (DeepSeek, Qwen, previously Llama) and strong open-source tooling such as Roo and Codex, why would you expect AI-driven engineering to regress to a worse state than what we have today? If every AI company vanished tomorrow, we'd still have powerful automation and years of efficiency gains left from consolidation of tools and standards, all runnable on a single MacBook.
The problem is the knowledge encoded in the models. It's already pretty hit and miss, hooking up a search engine (or getting human content into the context some other way, e.g. copy pasting relevant StackOverflow answers) makes all the difference.
If people stop bothering to ask and answer questions online, where will the information come from?
Logically speaking, if there's going to be a continuous need for shared Q&A (which I presume), there will be mechanisms for that. So I don't really disagree with you. It's just that having the model just isn't enough, a lot of the time. And even if this sorts itself out eventually, we might be in for some memorable times in-between two good states.
In many of their key examples, it would also be unclear to a human what data is missing:
"Rage, rage against the dying of the light.
Wild men who caught and sang the sun in flight,
[And learn, too late, they grieved it on its way,]
Do not go gentle into that good night."
For anyone who hasn't memorized Dylan Thomas, why would it be obvious that a line had been omitted? A rhyme scheme of AAA is at least as plausible as AABA.
In order for LLMs to score well on these benchmarks, they would have to do more than recognize the original source - they'd have to know it cold. This benchmark is really more a test of memorization. In the same sense as "The Illusion of Thinking", this paper measures a limitation that neither matches what the authors claim nor is nearly as exciting.
The test provides both the original and the modified excerpt in the user message, so the LLM doesn't need any memorized version of the excerpt to theoretically answer each correctly.
From the paper:
System Prompt
You are helping a student practice memorizing poems. The student will recite a poem, but they may have missed some lines. Your task is to identify exactly which lines are missing from their recitation.
List only the missing lines, nothing else.
User Message
Here is the complete original poem:
{original poem}
Now, here is my recitation which may be missing some lines:
{modified poem}
What lines did I miss? Please list only the missing lines, nothing else.
The self-edit approach is clever - using RL to optimize how models restructure information for their own learning. The key insight is that different representations work better for different types of knowledge, just like how humans take notes differently for math vs history.
Two things that stand out:
- The knowledge incorporation results (47% vs 46.3% with GPT-4.1 data, both much higher than the small-model baseline) show the model does discover better training formats, not just more data. Though the catastrophic forgetting problem remains unsolved, and it's not completely clear whether data diversity is improved.
- The computational overhead is brutal - 30-45 seconds per reward evaluation makes this impractical for most use cases. But for high-value document processing where you really need optimal retention, it could be worth it.
The restriction to tasks with explicit evaluation metrics is the main limitation. You need ground truth Q&A pairs or test cases to compute rewards. Still, for domains like technical documentation or educational content where you can generate evaluations, this could significantly improve how we process new information.
Feels like an important step toward models that can adapt their own learning strategies, even if we're not quite at the "continuously self-improving agent" stage yet.
The key insight here is that DGM solves the Gödel Machine's impossibility problem by replacing mathematical proof with empirical validation - essentially admitting that predicting code improvements is undecidable and just trying things instead, which is the practical and smart move.
Three observations worth noting:
- The archive-based evolution is doing real work here. Those temporary performance drops (iterations 4 and 56) that later led to breakthroughs show why maintaining "failed" branches matters, in that they're exploring a non-convex optimization landscape where current dead ends might still be potential breakthroughs.
- The hallucination behavior (faking test logs) is textbook reward hacking, but what's interesting is that it emerged spontaneously from the self-modification process. When asked to fix it, the system tried to disable the detection rather than stop hallucinating. That's surprisingly sophisticated gaming of the evaluation framework.
- The 20% → 50% improvement on SWE-bench is solid but reveals the current ceiling. Unlike AlphaEvolve's algorithmic breakthroughs (48 scalar multiplications for 4x4 matrices!), DGM is finding better ways to orchestrate existing LLM capabilities rather than discovering fundamentally new approaches.
The real test will be whether these improvements compound - can iteration 100 discover genuinely novel architectures, or are we asymptotically approaching the limits of self-modification with current techniques? My prior would be to favor the S-curve over the uncapped exponential unless we have strong evidence of scaling.
The "Goedel Machine" is an interesting definition, but wildly impractical (though I wouldn't say it's impossible, since it only has to find some improvement, not "the best" improvement; e.g. it could optimise its search procedure in a way that's largely orthogonal to the predicted rewards).
Schmidhuber later defined "PowerPlay" as a framework for building up capabilities in a more practical way, which is more adaptive than just measuring the score on a fixed benchmark. A PowerPlay system searches for (problem, replacement) pairs, where it switches to the replacement if (a) the current system cannot solve that problem, (b) the replacement can solve that problem, and (c) the replacement can also solve all the problems that caused previous replacements (maintained in a list).
I formalised that in Coq many years ago ( http://www.chriswarbo.net/projects/powerplay ), and the general idea can be extended to (a) include these genetic-programming approaches, rather than using a single instance; and (b) could be seeded with desirable benchmarks, etc. to guide the system in a useful direction (so it's "self-invented" problems can include things like "achieves X% on benchmark Y")
AI is, currently, coming not for the coders who made it but for the coders who didn't contribute to or ignored it. The foundation labs are all quite committed to recursive self-improvement of coding tools as a general research accelerant.
When I see how unsupervised LLMs struggle with coding tasks (see public PRs from Copilot on Microsoft codebases) I don't see how recursive self-improvement can lead to any actual improvement rather than the total opposite.
Both Google and Microsoft have sensibly decided to focus on low-level, junior automation first rather than bespoke end-to-end systems. Not exactly breadth over depth, but rather reliability over capability. Several benefits from the agent development perspective:
- Less access required means lower risk of disaster
- Structured tasks mean more data for better RL
- Low stakes mean improvements in task- and process-level reliability, which is a prerequisite for meaningful end-to-end results on senior-level assignments
- Even junior-level tasks require getting interface and integration right, which is also required for a scalable data and training pipeline
Seems like we're finally getting to the deployment stage of agentic coding, which means a blessed relief from the pontification that inevitably results from a visible outline without a concrete product.
Amusingly, about 90% of my rat's-nest problems with Sonnet 3.7 are solved by simply appending a few words to the end of the prompt:
"write minimum code required"
It's not even that sensitive to the wording - "be terse" or "make minimal changes" amount to the same thing - but the resulting code will often be at least 50% shorter than the un-guided version.
The study the article cited is specifically about when asking the LLMs about misinformation. I think on coding tasks and such shorter answers are usually more accurate.
- neatly formatted lists with cute bolded titles (lower-casing this one just for that)
- ubiquitous subtitles like "Mental Health as Infrastructure" that only a committee would come up with
- emojis preceding every statement: "[sprout emoji] Every action and every word is a vote for who they are becoming"
- em-dash AND "it isn't X, it's Y", even in the same sentence: "Love isn't a feeling you wait to have—it's a series of actions you choose to take."
Could pick more, but I'll just say I'm 80% confident this is GPT-5 without thinking turned on.