They inner layer permutability is super interesting. Is that result published anywhere? That's consistent with this graph here, which seems to imply different layers are kind of working in very related latent spaces.
If you skip to the graph here that shows the attention + feed forward displacements tending to align (after a 2d projection), is this something known/understood? Are the attention and feed forward displacement vectors highly correlated and mostly pointing in the same direction.
Skip to the graph above this paragraph: "Again, the red arrow represents the input vector, each green arrow represents one block’s self-attention output, each blue arrow represents one block’s feed-forward network output. Arranged tip to tail, their endpoint represents the final output from the stack of 6 blocks, depicted by the gray arrow."
If you skip to the graph here that shows the attention + feed forward displacements tending to align (after a 2d projection), is this something known/understood? Are the attention and feed forward displacement vectors highly correlated and mostly pointing in the same direction.
https://shyam.blog/posts/beyond-self-attention/
Skip to the graph above this paragraph: "Again, the red arrow represents the input vector, each green arrow represents one block’s self-attention output, each blue arrow represents one block’s feed-forward network output. Arranged tip to tail, their endpoint represents the final output from the stack of 6 blocks, depicted by the gray arrow."