Agreed, and I think the answer is pretty clear. Large models successful now have...

Agreed, and I think the answer is pretty clear.

Large models successful now have dodged recurrent architecture, which is harder to train but allows for open ended inference steps, which would allow straightforward scaling to any number of reasoning steps.

At some point, recurrent connections are going to get re-incorporated into these models.

Maybe two stage training. First stage, learn to integrate as much information as well as possible, without recurrence. As is happening now. Second training stage, embed that model in a larger iterative model, and train for variable step reasoning.

Finally, successful iterative reasoning responses can be used as further examples for the non-iterative module.

This would be similar to how we reason in steps at first, in unfamiliar areas. But quickly learn to reason with faster direct responses, as we gain familiarity.

We continually fine tune our fast mode on our own more powerful slow mode successes.