More

xianshou · 2026-02-04T01:14:07 1770167647

Nice! 5 bucks says you can swap this in for your average software kanban and it does a better job.

xianshou · 2026-01-28T21:43:54 1769636634

Safer than clawdbot/moltbot, I'll bet.

adastra22 · 2026-01-28T22:13:17 1769638397

What makes you think it isn’t clawdbot under the hood?

bamitsmanas · 2026-01-29T07:26:04 1769671564

it's not :)

yencabulator · 2026-01-29T19:10:50 1769713850

Why? It seems just as likely to follow prompt injection commands.

xianshou · 2026-01-15T03:25:06 1768447506

Incidentally, Chroma also produced the single best study on long-context degradation that I've come across:

https://research.trychroma.com/context-rot

Before that, I cited nolima (https://www.reddit.com/r/LocalLLaMA/comments/1io3hn2/nolima_...) constantly to illustrate how difficult tasks involving reasoning or multi-step information gathering degraded much faster than the needle-in-haystack benchmarks cited by the major labs. Now Chroma is the first stop. Nice job on the research!

xianshou · 2025-08-26T01:55:46 1756173346

Came to point out that this is transparently LLM-authored, was not disappointed. The signs:

- neatly formatted lists with cute bolded titles (lower-casing this one just for that)

- ubiquitous subtitles like "Mental Health as Infrastructure" that only a committee would come up with

- emojis preceding every statement: "[sprout emoji] Every action and every word is a vote for who they are becoming"

- em-dash AND "it isn't X, it's Y", even in the same sentence: "Love isn't a feeling you wait to have—it's a series of actions you choose to take."

Could pick more, but I'll just say I'm 80% confident this is GPT-5 without thinking turned on.

eadmund · 2025-08-26T02:34:04 1756175644

• Neatly-formatted lists Neatness could be a sign of a machine, or it could be a sign of a diligent human author.

• Subtitles only a committee would come up with That seems to me like a matter of opinion and taste — and we all have different tastes.

• Emojis preceding every statement I counted three emoji pull quotes in a multi-page document. I suppose it could be an LLM, but it could also just be a nice style.

• Em-dashes and ‘it isn’t X, it’s Y' This is why I posted in the first place, and downvoted you. There is nothing wrong with em-dashes — I love them. I use them a lot. Frankly, I probably overuse them. I’ve used them since I was a kid: I am going to use them — and over-use them — as long as I live. As for ‘Love isn’t a feeling you wait to have — it’s a series of actions you choose to take,’ that just seems like normal English to me.

It’s very possible in 2025 that the article was LLM-written, or written by a man and cleaned up by an LLM, or written by a man and proofread by an LLM, or written by a man. It does not have the stilted feel of most LLM works to me, but I might just miss it.

An em-dash isn’t an indicator of an LLM — it’s a sign of someone who discovered typography early.

xianshou · 2025-07-29T19:02:45 1753815765

I initially read the title as "My 2.5 year old can write Space Invaders in JavaScript now (GLM-4.5 Air)."

Though I suppose, given a few years, that may also be true!

DonHopkins · 2025-07-30T11:16:13 1753874173

Given a few years your 2.5 year old will be a 5.5 year old, too!

Breza · 2025-08-01T17:23:23 1754069003

Ugh don't remind me. My daughter's fifth birthday is tomorrow and with how fast she's growing I feel like her 15th is on Thursday.

xianshou · 2025-07-03T18:20:45 1751566845

Rug pulls from foundation labs are one thing, and I agree with the dangers of relying on future breakthroughs, but the open-source state of the art is already pretty amazing. Given the broad availability of open-weight models within under 6 months of SotA (DeepSeek, Qwen, previously Llama) and strong open-source tooling such as Roo and Codex, why would you expect AI-driven engineering to regress to a worse state than what we have today? If every AI company vanished tomorrow, we'd still have powerful automation and years of efficiency gains left from consolidation of tools and standards, all runnable on a single MacBook.

fhd2 · 2025-07-03T18:37:28 1751567848

The problem is the knowledge encoded in the models. It's already pretty hit and miss, hooking up a search engine (or getting human content into the context some other way, e.g. copy pasting relevant StackOverflow answers) makes all the difference.

If people stop bothering to ask and answer questions online, where will the information come from?

Logically speaking, if there's going to be a continuous need for shared Q&A (which I presume), there will be mechanisms for that. So I don't really disagree with you. It's just that having the model just isn't enough, a lot of the time. And even if this sorts itself out eventually, we might be in for some memorable times in-between two good states.

xianshou · 2025-06-20T23:46:49 1750463209

In many of their key examples, it would also be unclear to a human what data is missing:

"Rage, rage against the dying of the light.

Wild men who caught and sang the sun in flight,

[And learn, too late, they grieved it on its way,]

Do not go gentle into that good night."

For anyone who hasn't memorized Dylan Thomas, why would it be obvious that a line had been omitted? A rhyme scheme of AAA is at least as plausible as AABA.

In order for LLMs to score well on these benchmarks, they would have to do more than recognize the original source - they'd have to know it cold. This benchmark is really more a test of memorization. In the same sense as "The Illusion of Thinking", this paper measures a limitation that neither matches what the authors claim nor is nearly as exciting.

jamessinghal · 2025-06-20T23:48:26 1750463306

The test provides both the original and the modified excerpt in the user message, so the LLM doesn't need any memorized version of the excerpt to theoretically answer each correctly.

From the paper:

System Prompt You are helping a student practice memorizing poems. The student will recite a poem, but they may have missed some lines. Your task is to identify exactly which lines are missing from their recitation. List only the missing lines, nothing else.

User Message Here is the complete original poem: {original poem} Now, here is my recitation which may be missing some lines: {modified poem} What lines did I miss? Please list only the missing lines, nothing else.

scarface_74 · 2025-06-21T00:03:23 1750464203

This worked

https://chatgpt.com/share/6855f69d-766c-8010-96e2-ed1b45d3e6...

htnwe_2312412 · 2025-06-21T00:10:06 1750464606

yes, 69.8% of the time.

xianshou · 2025-06-13T21:36:29 1749850589

The self-edit approach is clever - using RL to optimize how models restructure information for their own learning. The key insight is that different representations work better for different types of knowledge, just like how humans take notes differently for math vs history.

Two things that stand out:

- The knowledge incorporation results (47% vs 46.3% with GPT-4.1 data, both much higher than the small-model baseline) show the model does discover better training formats, not just more data. Though the catastrophic forgetting problem remains unsolved, and it's not completely clear whether data diversity is improved.

- The computational overhead is brutal - 30-45 seconds per reward evaluation makes this impractical for most use cases. But for high-value document processing where you really need optimal retention, it could be worth it.

The restriction to tasks with explicit evaluation metrics is the main limitation. You need ground truth Q&A pairs or test cases to compute rewards. Still, for domains like technical documentation or educational content where you can generate evaluations, this could significantly improve how we process new information.

Feels like an important step toward models that can adapt their own learning strategies, even if we're not quite at the "continuously self-improving agent" stage yet.

xianshou · 2025-06-03T22:42:20 1748990540

The key insight here is that DGM solves the Gödel Machine's impossibility problem by replacing mathematical proof with empirical validation - essentially admitting that predicting code improvements is undecidable and just trying things instead, which is the practical and smart move.

Three observations worth noting:

- The archive-based evolution is doing real work here. Those temporary performance drops (iterations 4 and 56) that later led to breakthroughs show why maintaining "failed" branches matters, in that they're exploring a non-convex optimization landscape where current dead ends might still be potential breakthroughs.

- The hallucination behavior (faking test logs) is textbook reward hacking, but what's interesting is that it emerged spontaneously from the self-modification process. When asked to fix it, the system tried to disable the detection rather than stop hallucinating. That's surprisingly sophisticated gaming of the evaluation framework.

- The 20% → 50% improvement on SWE-bench is solid but reveals the current ceiling. Unlike AlphaEvolve's algorithmic breakthroughs (48 scalar multiplications for 4x4 matrices!), DGM is finding better ways to orchestrate existing LLM capabilities rather than discovering fundamentally new approaches.

The real test will be whether these improvements compound - can iteration 100 discover genuinely novel architectures, or are we asymptotically approaching the limits of self-modification with current techniques? My prior would be to favor the S-curve over the uncapped exponential unless we have strong evidence of scaling.

yubblegum · 2025-06-04T01:11:13 1748999473

> gaming the evaluation

Co-evolution is the answer here. The evaluator itself must be evolving.

Co-evolving Parasites Improve Simulated Evolution as an Optimization Procedure Danny Hillis, 1991

https://csmgeo.csm.jmu.edu/geollab/complexevolutionarysystem...

sdl · 2025-06-04T11:08:31 1749035311

And in Reinforcement Learning:

POET (Paired Open-Ended Trailblazer): https://www.uber.com/en-DE/blog/poet-open-ended-deep-learnin...

SCoE (Scenario co-evolution): https://dl.acm.org/doi/10.1145/3321707.3321831

chriswarbo · 2025-06-04T12:01:16 1749038476

The "Goedel Machine" is an interesting definition, but wildly impractical (though I wouldn't say it's impossible, since it only has to find some improvement, not "the best" improvement; e.g. it could optimise its search procedure in a way that's largely orthogonal to the predicted rewards).

Schmidhuber later defined "PowerPlay" as a framework for building up capabilities in a more practical way, which is more adaptive than just measuring the score on a fixed benchmark. A PowerPlay system searches for (problem, replacement) pairs, where it switches to the replacement if (a) the current system cannot solve that problem, (b) the replacement can solve that problem, and (c) the replacement can also solve all the problems that caused previous replacements (maintained in a list).

I formalised that in Coq many years ago ( http://www.chriswarbo.net/projects/powerplay ), and the general idea can be extended to (a) include these genetic-programming approaches, rather than using a single instance; and (b) could be seeded with desirable benchmarks, etc. to guide the system in a useful direction (so it's "self-invented" problems can include things like "achieves X% on benchmark Y")

xianshou · 2025-06-02T14:20:43 1748874043

AI is, currently, coming not for the coders who made it but for the coders who didn't contribute to or ignored it. The foundation labs are all quite committed to recursive self-improvement of coding tools as a general research accelerant.

iLoveOncall · 2025-06-02T17:19:46 1748884786

When I see how unsupervised LLMs struggle with coding tasks (see public PRs from Copilot on Microsoft codebases) I don't see how recursive self-improvement can lead to any actual improvement rather than the total opposite.