SIMD in Pure Python

ashvardanian · on Jan 5, 2024

This is a nice exercise!

There is also a very different “write SIMD assembly in Python” approach available through the PeachPy library, one of the least known gems between Python and HPC worlds: https://github.com/Maratyszcza/PeachPy

This is what a dot-product would look like in PeachPy: https://unum-cloud.github.io/usearch/python/index.html#id4

PS: Cppyy and Numba are also fun to use in such projects :)

brrrrrm · on Jan 5, 2024

PeachPy is awesome - it's really a great way to play with assembly-level performance programming and not have to deal with assembler toolchains

eigenket · on Jan 5, 2024

> This is what a dot-product would look like in PeachPy

Is it just me (I'm far from an expert here) or is this code really weird? Why does ymm_one appear to contain the number zero? Why do we subtract what looks like it should be the inner product we want from ymm_one at the end?

t-vi · on Jan 6, 2024

The subtraction is because "is an example of constructing the “Inner Product” distance" per the text above it. That ymmone might not be one could be because they only need that up to a constant and so don't care, but it's probably not ideal to name the thing containing 0s ymmone.

eigenket · on Jan 8, 2024

Is there a context where it is usual to call minus the dot-product a distance?

eigenket · on Jan 5, 2024

I can't edit my previous comment for some reason but it looks like the code

    # Negate the values, to go from "similarity" to "distance"
        VSUBPS(ymm_c, ymm_one, ymm_c)

Is actually just subtracting zero from ymm_c for no reason I understand.

Retr0id · on Jan 5, 2024

I believe the argument order is "a = b - c", so it's effectively `ymm_c = 0 - ymm_c`. (And yes, ymm_one does seem to be initialized to zero, I'm not sure why they picked that name for it...)

eigenket · on Jan 5, 2024

That makes slightly more sense, but then why the heck do you end up needing to negate something while computing a dot product?

ashvardanian · on Jan 6, 2024

Sorry, I meant the dot product distance. For vectors A and B it's equal to 1-AB. ...ripped it out of my vector search library :)

eigenket · on Jan 8, 2024

That's a pretty weird "distance" unless you have some additional constraint like assuming the vectors are 2-norm normalised, in which case you get |a-b|^2 = 2(1-a.b).

It is natural to assume vectors have distance 0 from themselves at least, which this thing doesn't (in general). E.g. if you compute this "distance" for a = b = (2,0) then you get that the distance from a to itself is -3 which seems pretty weird.

justinl33 · on Jan 5, 2024

Genius!

>* the state of the cells is being stored in a big array, accessed via the get_cell and set_cell helper functions. What if instead of using an array, we stored the whole state in one very long integer, and used SWAB arithmetic to process the whole thing at once?*

I’m really curious as to how the unpacking of this long integer into pixels on the screen doesn’t add more overhead than it saves. I guess I’ll have to wait for your next one on the compressed gzip stream hack.

nneonneo · on Jan 5, 2024

If I had to guess, it’s a matter of setting up the Huffman tables such that every code ends up with a length of four bits, and then prepending that precomputed Huffman table onto the “compressed” stream. Then you can abuse the zlib library as a decompressor mapping the 4-bit “Huffman” codes to regular bytes.

Of course, this won’t be anywhere near as efficient as just implementing the decoder in C, as Huffman decoding needs to be much more general than just unpacking fixed-width chunks. But it will definitely be an improvement over naive loops.

I wonder whether .hex() could be pressed into service as a very scuffed 4-bit unpacker? Maybe something like .hex().encode().translate() to get an arbitrary palette mapping?

Retr0id · on Jan 5, 2024

Yup, that's pretty much exactly it.

.hex() is a good idea, and probably has a decent chance of beating gzip, assuming .translate() is fast enough.

The fastest approach for the bitsliced AES impl ended up being pretty cursed: https://github.com/DavidBuchanan314/python-bitsliced-aes/blo...

1000thVisitor · on Jan 6, 2024

> Of course, this won’t be anywhere near as efficient as just implementing the decoder in C

> It is quite easy to add new built-in modules to Python, if you know how to program in C. Such extension modules can do two things that can’t be done directly in Python: they can implement new built-in object types, and they can call C library functions and system calls.

Then why are you talking about this instead of extension modules?

nneonneo · on Jan 5, 2024

Neat trick! I implemented a similar bitpacking approach for solving matrix equations in GF(2), which can be used for things like forging CRC hashes (faster than bruteforce) and solving certain cryptography problems. Code is here: https://github.com/nneonneo/pwn-stuff/blob/master/math/gf2.p....

Retr0id · on Jan 5, 2024

Ha! As it happens, I've recently written near identical code, and was considering writing about it in a future article - I originally had leaking internal states of xorshift128+ in mind :)

dinkleberg · on Jan 5, 2024

This is great, nice work!

I just recently started going through a performance programming course (https://computerenhance.com/) and have learned about SIMD and other techniques and it is awesome to see something out in the wild.

redskyluan · on Jan 5, 2024

Just stumbled upon this blog it's absolutely intriguing! As a Python enthusiast, it's like finding a hidden treasure that challenges the usual norms of Python's capabilities. Thinks of writing a Pure python implementation of some ml algos in learning SIMD~

jvans · on Jan 5, 2024

Exercises like this really make you a better programmer. Someone should collect these types of examples in a git repo somewhere.

jxy · on Jan 5, 2024

tl;dr

> The general term for this concept is SWAR, which stands for SIMD Within A Register. But here, rather than using a machine register, we're using an arbitrarily long Python integer. I'm calling this variant SWAB: SIMD Within A Bigint.

Thanks to Peano and Godel, it's safe to say we may encode any compute with operations on natural numbers. So if anything is slow in Python for you, you may always encode it in Bigint and hope for the best.

akasakahakada · on Jan 5, 2024

np.array([ BigInt list ], dtype=object)

then you can apply this bit parallelism to a tensor.