The vector instructions can't really be farmed out because they can be scattered inline with regular scalar code. A memcopy of a small to medium-sized struct might be compiled into a bunch of 128bit mov for example and then immediately working on that moved struct. If you were to offload that to a different processor waiting on that work to finish would stall the entire pipeline.
Could the compiler create a binary that had those instructions running on multiple processors? I see now I have some googling/reading to do about how you even use multiple processors (not cores) in a program.
> A memcopy of a small to medium-sized struct might be compiled into a bunch of 128bit mov for example and then immediately working on that moved struct
I'm not sure that's true: rep movs is pretty fast these days.
> If you believe this, you won't believe what's in this box[1].
There's a fundamental difference between GPU code and vector CPU instructions, though. GPU shader instructions aren't interwoven with the CPU instructions.
Yes, if you restrict yourself to not arbitrarily mixing the vector code with the non-vector code, you can put the vector code off in a dedicated processor (GPU in this case). The GP explicitly stated that a lack of this restriction prevents efficiently farming it off to a coprocessor.
> I'm not sure that's true: rep movs is pretty fast these days.
That's only true if you target skylake and newer. If you target generic x86_64 then compilers will only emit rep mov for long copies due to some CPUs having a high baseline cost for it. There's some linker magic that might get you some optimized version when you callq memcpy, but that doesn't help with inlined copies.
That is irrelevant. The default target of compilers is some conservative minimum profile. Any binary you download is compiled for wide compatibility, not to run on your computer only.
That’s different. Rendering happens entirely on the GPU, so the only data transfer is a one-way DMA stream containing scene primitives and instructions.
There's absolutely no reason it _has_ to be one-way: It's not like the CPU intrinsically speaks x86_64 or is directly attached to memory anyway. When inventing a new ISA we can do anything.
And if we're talking about memcpy over (small) ranges that are likely still in L1 you're definitely not going to notice the difference.