I liked reading the different implementations of the low-level tensor ops (simple C/AVX/AVX2/WASM128bit/ARM-NEON) -- it will help me learn about how to use x86 ASM. Thank you for writing this! Do you have any other recommendations/examples on how numerical code can be optimized via SIMD routines?
I don't have other recommendations as I am a novice myself when it comes to SIMD. I think the multiplication routines in `whisper.cpp` are relatively basic - dot product and fused multiply-add. With a few trial and errors I came up with these implementations - not sure if they are optimal.