The vector instructions can't really be farmed out because they can be scattered...

maerF0x0 · on April 23, 2020

Could the compiler create a binary that had those instructions running on multiple processors? I see now I have some googling/reading to do about how you even use multiple processors (not cores) in a program.

wmf · on April 23, 2020

That's what we call the magic impossible holy grail parallelizing compiler.

maerF0x0 · on April 23, 2020

Good to know before I run off looking for the answer :)

zucker42 · on April 23, 2020

The technological knowledge to do this is years and years away.

geocar · on April 23, 2020

> The vector instructions can't really be farmed out because they can be scattered inline with regular scalar code.

If you believe this, you won't believe what's in this box[1].

[1]: https://www.sonnettech.com/product/egfx-breakaway-puck.html

> A memcopy of a small to medium-sized struct might be compiled into a bunch of 128bit mov for example and then immediately working on that moved struct

I'm not sure that's true: rep movs is pretty fast these days.

KMag · on April 23, 2020

> If you believe this, you won't believe what's in this box[1].

There's a fundamental difference between GPU code and vector CPU instructions, though. GPU shader instructions aren't interwoven with the CPU instructions.

Yes, if you restrict yourself to not arbitrarily mixing the vector code with the non-vector code, you can put the vector code off in a dedicated processor (GPU in this case). The GP explicitly stated that a lack of this restriction prevents efficiently farming it off to a coprocessor.

the8472 · on April 23, 2020

> I'm not sure that's true: rep movs is pretty fast these days.

That's only true if you target skylake and newer. If you target generic x86_64 then compilers will only emit rep mov for long copies due to some CPUs having a high baseline cost for it. There's some linker magic that might get you some optimized version when you callq memcpy, but that doesn't help with inlined copies.

geocar · on April 24, 2020

I think people with computers more than five years old already know that their computer is slow.

Why exactly do you think seven-years-old is too-old, but five-years-old isn't?

the8472 · on April 24, 2020

That is irrelevant. The default target of compilers is some conservative minimum profile. Any binary you download is compiled for wide compatibility, not to run on your computer only.

jl2718 · on April 23, 2020

That’s different. Rendering happens entirely on the GPU, so the only data transfer is a one-way DMA stream containing scene primitives and instructions.

geocar · on April 24, 2020

There's absolutely no reason it _has_ to be one-way: It's not like the CPU intrinsically speaks x86_64 or is directly attached to memory anyway. When inventing a new ISA we can do anything.

And if we're talking about memcpy over (small) ranges that are likely still in L1 you're definitely not going to notice the difference.

imtringued · on April 24, 2020

By definition a co-processor won't share the L1 cache with another processor.

geocar · on April 24, 2020

Exactly.