> As the context fills up, the LLM places less attention on further and further back in the context, that's why the LLM seems dumber and dumber as a conversation goes on.
This is not entirely true. They pay the most attention to the things that are the earliest in history and the most recent in it, while the middle between the two is where the dip is. Which basically means that the system prompt (which is always on top) is always going to have attention. Or, perhaps, it would be more accurate to say that because they are trained to follow the system prompt - which comes first - that's what they do.
Larger contexts are inherently more attention-taxing, so the more you throw at it, the higher the probability that any particular thing is going to get ignored. But that probability still varies from lower at the beginning to higher in the middle and back to lower in the end.
This is not entirely true. They pay the most attention to the things that are the earliest in history and the most recent in it, while the middle between the two is where the dip is. Which basically means that the system prompt (which is always on top) is always going to have attention. Or, perhaps, it would be more accurate to say that because they are trained to follow the system prompt - which comes first - that's what they do.