Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They inner layer permutability is super interesting. Is that result published anywhere? That's consistent with this graph here, which seems to imply different layers are kind of working in very related latent spaces.

If you skip to the graph here that shows the attention + feed forward displacements tending to align (after a 2d projection), is this something known/understood? Are the attention and feed forward displacement vectors highly correlated and mostly pointing in the same direction.

https://shyam.blog/posts/beyond-self-attention/

Skip to the graph above this paragraph: "Again, the red arrow represents the input vector, each green arrow represents one block’s self-attention output, each blue arrow represents one block’s feed-forward network output. Arranged tip to tail, their endpoint represents the final output from the stack of 6 blocks, depicted by the gray arrow."



Those curves of "embedding displacement" are very interesting!

quickly scanning the blog led to this notebook which shows how they're computed and shows other examples too with similar behavior. https://github.com/spather/transformer-experiments/blob/mast...


I haven't published it nor have I seen it published.

I can copy paste some of my raw notes / outputs from poking around with a small model (Phi-1.5) into a gist though: https://gist.github.com/bluecoconut/6a080bd6dce57046a810787f...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: