They're Markov chain generators with weighting that looks many tokens back and a...

They're Markov chain generators with weighting that looks many tokens back and assigns, based on a training corpus, higher weight ("attention") to tokens that are more likely to significantly influence the probability of later tokens ("evolutionary" might get greater weight than "the", for instance, though to be clear tokens aren't necessarily the same as words), then smears those various weights together before rolling its newly-weighted dice to come up with the next token.

Throw in some noise-reduction that disregards too-low probabilities, and that's basically it.

This dials down the usual chaos of Markov chains, and makes their output far more convincing.

Yes, that's really what all this fuss is about. Very fancy Markov chains.