*Before that, instruction sets were driven more by aesthetics and marketing than...

jabl · on March 22, 2018

> REP MOVS is still the fastest and most efficient way to do mempcy() on most Intel hardware, and likewise REP STOS for memset()

Would that be the original REP MOVS/STOS, or the "fast strings" (P6)? Or the "this time we really mean it fast strings" (Ivy Bridge)? Or the "honestly, just trust us this time fast strings" (Ice Lake)?

> Now, with a cache miss taking tens to hundreds of cycles or more, it seems a few extra clock cycles decoding more complex instructions to avoid that is the better option

You can have both, actually. E.g. RISC-V with the compressed instruction extension achieves higher code density than x86-64.

> and dedicated hardware for things like AES and SHA256 obviously can't be beat by a "pure RISC"

Well, if you're really looking for minimal instruction sets, RISC is way bloated; IIRC single-instruction computers can be Turing complete. Obviously they are not very useful in practice. I think a better approximation of the RISC philosophy is "death to microcode", that is, the instruction set should match the hardware that the chip has. So if your chip has dedicated HW for some crypto or hashing algorithm, I wouldn't consider it "un-RISCy" to expose that in the ISA.

tlb · on March 22, 2018

Behold the OSX library version of memcpy, 177 instructions long: https://gist.github.com/tlbtlbtlb/6f6fdc5154210dc72950d8ef02.... It unrolls to 128 byte chunks. Presumably it's enough faster than REP MOVS to justify the memory footprint.

I believe most libc implementations are similar, such as glibc: https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=strin...