Yes: for eg BPE, due to how it progressively pushes compound tokens of already s...

Yes: for eg BPE, due to how it progressively pushes compound tokens of already seen - hence more common - subtokens to the ‘top’ of the vocab), you can train a model to do regression over vocabulary index for the next token from the current token embedding - using the same single regression model for all layer depths. If you plot mse of token index prediction versus layer depth then you can see that the mse of the prediction decreases steadily per additional layer. This appears to be because token index in eg BPE is actually fairly smooth and so it seems like the model is capable of localizing to the actual correct vocab index as depth increases, so kind of like a fuzzy->discrete refinement as you go deeper in layers https://arxiv.org/abs/2408.13442