Opinion: a lot can change over such a span of time and knowledge goes in and out of relevance - I think the natural progression of models shrinking in parameter count goes to show it's better to know how to use knowledge than to attempt to remember everything.
That said, optimising for capability of maximal learning seems to be a natural occurrence in nature.
I think the non-obvious emergent effects are something to look into.
Culling bad models in favour of the A/B version and check pointing is a kind of combination of the two and the feedback loop of models trained on new snapshots of Internet data that are written with humans and AI.
There's an unintended long-form training loop which I think is going to get weirder as time goes on.
The wave of models being able to manipulate Cursor / Windsurf etc., being trained to be smarter and more efficient at this and then being retrained for other purposes, even though the model is deleted, the pattern of data can be saved and trained into more advanced models over time.
That said, optimising for capability of maximal learning seems to be a natural occurrence in nature.
I think the non-obvious emergent effects are something to look into.
Culling bad models in favour of the A/B version and check pointing is a kind of combination of the two and the feedback loop of models trained on new snapshots of Internet data that are written with humans and AI.
There's an unintended long-form training loop which I think is going to get weirder as time goes on.
The wave of models being able to manipulate Cursor / Windsurf etc., being trained to be smarter and more efficient at this and then being retrained for other purposes, even though the model is deleted, the pattern of data can be saved and trained into more advanced models over time.