Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Before that, instruction sets were driven more by aesthetics and marketing than performance. That sold chips in a world where people wrote assembly code -- instructions were added like language features. Thus instructions like REPNZ SCAS (ie, strlen) which was sweet if you were writing string handling code in assembler.

REP MOVS is still the fastest and most efficient way to do mempcy() on most Intel hardware, and likewise REP STOS for memset(), because they work on entire cachelines at once.

It's worth noting that, if it weren't for a brief period of time during the late 70s/early 80s when memory was faster than the processor, RISC as we know it may have never been developed; the CISCs at the time were spending the majority of the time decoding and executing instructions, leaving memory idle, and that's what the RISC concept could make use of --- by trading off fetch bandwidth for faster instruction decoding and execution, they could gain more performance. However, the situation was very different after that, with continually increasing memory latencies and now multiple cores all needing to be fed with an instruction stream putting RISC's increased fetch bandwidth at a disadvantage. Now, with a cache miss taking tens to hundreds of cycles or more, it seems a few extra clock cycles decoding more complex instructions to avoid that is the better option, and dedicated hardware for things like AES and SHA256 obviously can't be beat by a "pure RISC". It was almost "game over" for "RISC=performance" in the early 90s when Intel figured out how to decode CISC instructions quickly and in parallel with the P5/P6, and rapidly overtook the MIPS, SPARCS, and ALPHAs that needed more cache, higher clock speeds, and power consumption to achieve comparable performance.

Certainly, it makes one wonder whether, had that brief moment in time not existed and memory was always significantly slower than the processor, would CPU designs have taken a completely different direction?



> REP MOVS is still the fastest and most efficient way to do mempcy() on most Intel hardware, and likewise REP STOS for memset()

Would that be the original REP MOVS/STOS, or the "fast strings" (P6)? Or the "this time we really mean it fast strings" (Ivy Bridge)? Or the "honestly, just trust us this time fast strings" (Ice Lake)?

> Now, with a cache miss taking tens to hundreds of cycles or more, it seems a few extra clock cycles decoding more complex instructions to avoid that is the better option

You can have both, actually. E.g. RISC-V with the compressed instruction extension achieves higher code density than x86-64.

> and dedicated hardware for things like AES and SHA256 obviously can't be beat by a "pure RISC"

Well, if you're really looking for minimal instruction sets, RISC is way bloated; IIRC single-instruction computers can be Turing complete. Obviously they are not very useful in practice. I think a better approximation of the RISC philosophy is "death to microcode", that is, the instruction set should match the hardware that the chip has. So if your chip has dedicated HW for some crypto or hashing algorithm, I wouldn't consider it "un-RISCy" to expose that in the ISA.


Behold the OSX library version of memcpy, 177 instructions long: https://gist.github.com/tlbtlbtlb/6f6fdc5154210dc72950d8ef02.... It unrolls to 128 byte chunks. Presumably it's enough faster than REP MOVS to justify the memory footprint.

I believe most libc implementations are similar, such as glibc: https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=strin...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: