There is also a very different “write SIMD assembly in Python” approach available through the PeachPy library, one of the least known gems between Python and HPC worlds: https://github.com/Maratyszcza/PeachPy
> This is what a dot-product would look like in PeachPy
Is it just me (I'm far from an expert here) or is this code really weird? Why does ymm_one appear to contain the number zero? Why do we subtract what looks like it should be the inner product we want from ymm_one at the end?
The subtraction is because "is an example of constructing the “Inner Product” distance" per the text above it.
That ymmone might not be one could be because they only need that up to a constant and so don't care, but it's probably not ideal to name the thing containing 0s ymmone.
I believe the argument order is "a = b - c", so it's effectively `ymm_c = 0 - ymm_c`. (And yes, ymm_one does seem to be initialized to zero, I'm not sure why they picked that name for it...)
That's a pretty weird "distance" unless you have some additional constraint like assuming the vectors are 2-norm normalised, in which case you get |a-b|^2 = 2(1-a.b).
It is natural to assume vectors have distance 0 from themselves at least, which this thing doesn't (in general). E.g. if you compute this "distance" for a = b = (2,0) then you get that the distance from a to itself is -3 which seems pretty weird.
>* the state of the cells is being stored in a big array, accessed via the get_cell and set_cell helper functions. What if instead of using an array, we stored the whole state in one very long integer, and used SWAB arithmetic to process the whole thing at once?*
I’m really curious as to how the unpacking of this long integer into pixels on the screen doesn’t add more overhead than it saves. I guess I’ll have to wait for your next one on the compressed gzip stream hack.
If I had to guess, it’s a matter of setting up the Huffman tables such that every code ends up with a length of four bits, and then prepending that precomputed Huffman table onto the “compressed” stream. Then you can abuse the zlib library as a decompressor mapping the 4-bit “Huffman” codes to regular bytes.
Of course, this won’t be anywhere near as efficient as just implementing the decoder in C, as Huffman decoding needs to be much more general than just unpacking fixed-width chunks. But it will definitely be an improvement over naive loops.
I wonder whether .hex() could be pressed into service as a very scuffed 4-bit unpacker? Maybe something like .hex().encode().translate() to get an arbitrary palette mapping?
> Of course, this won’t be anywhere near as efficient as just implementing the decoder in C
> It is quite easy to add new built-in modules to Python, if you know how to program in C. Such extension modules can do two things that can’t be done directly in Python: they can implement new built-in object types, and they can call C library functions and system calls.
Then why are you talking about this instead of extension modules?
Neat trick! I implemented a similar bitpacking approach for solving matrix equations in GF(2), which can be used for things like forging CRC hashes (faster than bruteforce) and solving certain cryptography problems. Code is here: https://github.com/nneonneo/pwn-stuff/blob/master/math/gf2.p....
Ha! As it happens, I've recently written near identical code, and was considering writing about it in a future article - I originally had leaking internal states of xorshift128+ in mind :)
I just recently started going through a performance programming course (https://computerenhance.com/) and have learned about SIMD and other techniques and it is awesome to see something out in the wild.
Just stumbled upon this blog it's absolutely intriguing! As a Python enthusiast, it's like finding a hidden treasure that challenges the usual norms of Python's capabilities. Thinks of writing a Pure python implementation of some ml algos in learning SIMD~
> The general term for this concept is SWAR, which stands for SIMD Within A Register. But here, rather than using a machine register, we're using an arbitrarily long Python integer. I'm calling this variant SWAB: SIMD Within A Bigint.
Thanks to Peano and Godel, it's safe to say we may encode any compute with operations on natural numbers. So if anything is slow in Python for you, you may always encode it in Bigint and hope for the best.
There is also a very different “write SIMD assembly in Python” approach available through the PeachPy library, one of the least known gems between Python and HPC worlds: https://github.com/Maratyszcza/PeachPy
This is what a dot-product would look like in PeachPy: https://unum-cloud.github.io/usearch/python/index.html#id4
PS: Cppyy and Numba are also fun to use in such projects :)