The word "moonshot" is close to the truth, yet there are strategies to overcome ...

gruseom · on April 16, 2013

Dataflow is really the way to go to lower energy consumption dramatically.

Could you expand on this for a layperson? I'm terribly interested.

knz42 · on April 16, 2013

Minimum energy usage is very dependent on not activating more circuits than strictly required for a given computation.

However a conventional processor pipeline will usually fetch instructions and begin processing them, only to realize later on that they were not necessary. This happens upon mispredicted branches, cache misses, exceptions, etc. These correspond to circuits that get activated, spend energy, only to throw away their results because the instruction's effects must be discarded.

In contrast, in a dataflow processor, each instruction indicates explicitly which other instruction(s) will produce its input. Or conversely, which other instruction(s) get activated as the result of one instruction completing execution. This way, instructions only enter the pipeline when their operands are ready, and speculation never occurs. So there is no more energy spent than strictly necessary to do the work (instructions).

Now, the reason why we use the former forms of speculation is that it is the only way to make the pipeline fast if there is no information in the instruction stream (program) about the dataflow dependencies between instructions. Because it does not know better, the scheduler has to either: 1) try all instructions in program order, start do work as early as possible, and sometimes need to discard the work already started because an earlier instruction has decided a branch / fault / etc. or 2) rediscover the dataflow links by analyzing the instructions as they enter the processor, but then again the silicon logic to implement these tricks is also costing energy.

The funny thing is, all compilers know about dataflow dependencies between instructions, but they throw the information away because the existing instruction sets cannot encode it.

So really the situation should be simple: make new processors that support dataflow annotations, extend the compilers to encode this information (which they already have anyways), and off we go.

However as others have highlighted making new instruction sets is like a "moonshot" because you have to involve a lot of people: compiler implementers, but also OS devs and everyone who will need to port their code to the new ISA.

Besides, dataflow processors have a gorilla in the kitchen too. In a "pure" dataflow scheduler, all the instruction order is destroyed and as a result, cache locality is broken. So the flip side of the coin becomes 1) bad memory performance 2) extra energy expenditure on the memory system to deal with cache misses.

Now there are ways to get the best of both worlds.

One is to destroy the ordering of instructions only partially, by only applying dataflow scheduling on a window (eg the next 20 instructions). This is more or less what modern out-of-order processors do, although they still waste energy re-discovering dataflow links at run-time.

The other technique is where many of us are going right now: use multiple hardware threads interleaved; keep the instruction order within threads to exploit whichever locality is encoded by the programmer, and apply dataflow scheduling techniques across instructions from separate threads, ie exploit maximum instruction concurrency between independent threads. Sun/Oracle started it with Niagara, now ARM is going there too. This approach really works very well in terms of operations / watt, however it requires software to use threads in the first place and not much software does that (yet).

Also there is still a lot of ongoing research.

mikemike · on April 16, 2013

Adding to that:

For many applications, pure power consumption isn't even the best metric anymore. Due to advances in on-chip power and clock distribution, the energy x delay product and overall silicon efficiency have gained more importance in past years.

Obviously, dataflow processors excel in these metrics. And VLIW processors fall behind, which IMHO is the primary reason for their demise.

I agree with you that a practical dataflow architecture needs to be hierarchical. Not just for cache locality, but to reduce wiring overhead and debugging complexity, too.

mwcampbell · on April 16, 2013

If compilers already have the dependency information and could provide it in instruction annotations, then why hasn't Intel done anything with this? Intel has its own C/C++ compiler, so it could extend the x86 instruction set with new instructions that contain the necessary annotations, and add support for these annotations in its compiler.

knz42 · on April 16, 2013

Except that would not show lower energy usage. To preserve backward compatibility, the cores would need to keep the logic that analyses data dependencies to continue delivering good performance to legacy code. To make any difference they would need to both do what you say, and also define some protocol to instructs the processor to disable the dataflow analysis unit entirely (to save energy). But that protocol would be invasive, because you need to re-activate the unit at the first instruction that is not annotated, and upon faults, branches, etc. The logic to coordinate this protocol becomes a new energy expenditure on its own!

Really the way forward would be to extend x86 completely, with a "mode" where all instructions are annotated and go through a different pipeline front-end than legacy x86 code. But Intel already tried that with IA64, and it burned them very hard. I am not sure they are willing to do it again.

jacques_chester · on April 17, 2013

IA64 was an entirely new instruction set on an entirely new architecture; having nothing whatever to do with x86. Compatibility for x86 was added later to try and improve sales.

AMD64 / x64 pretty much hops into a different mode and goes on executing from there. Given how many modes and instructions these chips support, I don't see why adding another would easily upset people.

knz42 · on April 17, 2013

Yes you are right, if that was marketed like the introduction of x64 it may work.

However there is a big difference: the move to 64-bit words was something that was in demand when it was introduced. There was a market for it, with a very clear value proposal.

In contrast a new "mode" with the same computational power plus dataflow annotations would be a tough sell: larger code size, and and better performance / watt only for some applications.

(Also, as far as I know AMD64 / x64 on Intel cores uses the same decode and issue unit, just with different microcode. Circuit-wise there is a lot in common with x86.

Here we would be talking about a new mode and also a new instruction+data path in the processor. The core would become larger and more expensive. Not sure how that plays.)

jacques_chester · on April 17, 2013

Is there something about dataflow annotated instructions that would require lots of internal changes? I mean past the decoding step. Because Intel and AMD add new instructions and modes all the time, frequently stuff that is basically totally orthogonal to whatever's gone before.

mwcampbell · on April 16, 2013

In your research, are you using a new programming language to take advantage of dataflow scheduling techniques, or are you working with one or more existing languages? If the latter, do you have any data or opinions on which languages or language features are most amenable to an effective dataflow-based architecture?

knz42 · on April 17, 2013

We use just C extensions for now, very close to what Cilk does.

1) What is really important is to realize that dataflow variables (I-structures) are not in memory. So any language/library that gives dataflow semantics to programmers should not allow programmers to indirect through memory to get to the I-structures. This is the main requirement for an efficient projection to a hardware dataflow scheduler.

In practice, things like Occam, SISAL and most pure functional programming languages are OK-ish.

2) any language should allow an (advanced) programmer (or compiler) to annotate the instructions to also suggest some ordering not related to data dependencies. As I explained before dataflow scheduling tends to destroy order and break locality, and for some applications this is very bad. Unfortunately all existing dataflow-ish languages (incl most functional languages) were designed with the outdated vision that all memory accesses have the same cost. We now know this is not true any more.

Other than using threads (as I explained before) a well-known theoretical way forward is to introduce optional control flow edges between instructions using "ghost" data dependencies, which impact scheduling but do not allocate registers/I-vars. However I am not aware of languages where this is possible.

gruseom · on April 17, 2013

Could you expand on the following?

What is really important is to realize that dataflow variables (I-structures) are not in memory. So any language/library that gives dataflow semantics to programmers should not allow programmers to indirect through memory to get to the I-structures. This is the main requirement for an efficient projection to a hardware dataflow scheduler

I don't know what you mean by I-structures, or dataflow variables, or why they are not in memory, or why allowing programmers to get at them would mess with a hardware dataflow schedule. I can guess at what you mean by all of these, but a more detailed explanation would still be very interesting.

yvdriess · on April 18, 2013

See: [5] Arvind, R. S. Nikhil, and K. K. Pingali, “I-structures: data structures for parallel computing,” Transactions on Programming Languages and Systems (TOPLAS, vol. 11, no. 4, Oct. 1989.

gruseom · on April 18, 2013

Thank you.

gruseom · on April 16, 2013

Thank you. Great answer, great thread.

yvdriess · on April 16, 2013

ping

I am very interested in this! Drop me a line at my nick @ vub.ac.be