These OCR improvements will almost certainly be brought to google books, which i...

levocardia · 2025-12-05T23:05:20 1764975920

This is a really interesting "data flywheel" -- better model >> more usable data >> even better model

tills13 · 2025-12-06T00:17:06 1764980226

surely there's an upper limit to this though with models literally eating themselves.

visarga · 2025-12-06T14:17:25 1765030645

Not always, you can improve the loop by putting something real inside, like, a code execution tool, a search engine, a human, other AIs or an API. As long as the model can make use of that external environment its data can improve. By the same logic a human isolated from other humans for a long time might also be in a situation of going crazy.

Practical example - using LLMs to create deep research reports. It pulls over 500 sources into a complex analysis, and after all that compiling and contrasting it generates an article with references, like a wiki page. That text is probably superior to most of its sources in quality. It does not trust any one source completely, it does not even pretend to present the truth, it only summarizes the distribution of information it found on the topic. Imagine scaling wikipedia 1000x by deep-reporting every conceivable topic.

Workaccount2 · 2025-12-06T13:00:26 1765026026

They already purposely train them on their own output, it's called synthetic training data.

Choco31415 · 2025-12-06T07:04:19 1765004659

We can wait for that to start appearing in tests or benchmarks first.

jeffbee · 2025-12-06T00:34:15 1764981255

When a human students learns to read more carefully we don't consider that a negative.

kridsdale3 · 2025-12-05T21:16:29 1764969389

More Data for the Data Gods!