Attention Residuals

djhemath · 2026-04-07T03:34:58 1775532898

This paper by the Kimi team allows us to add more depth to the model without losing information/context. Although it increases efficiency by just over 1%, the total savings could reach millions. Or at least, it would allow us to build models with more layers for the same cost as today.