It's proprietary but somewhat amenable to exploring through experiment. The heur...

It's proprietary but somewhat amenable to exploring through experiment.

The heuristics go something like:

1. Find out what execution ports your processor has. E.g. it can probably do two 256bit loads from L1 cache each cycle and probably can't do two stores. It can do arithmetic at the same time. Beware collisions between your arithmetic and address calculations.

2. Look for some indication of what the register files are - you don't want to read from a register immediately and probably don't want to wait too long either, and there's a load of latency hiding renaming going on in the background. This one seems especially poorly documented.

3. Aim is to order instructions so that the dynamic scheduler has an easier time keeping the ports occupied and so that stalls on register access are unlikely

4. Choosing different instructions may make that work better in a gnarly NP over NP sort of fashion

5. Moving redundant or reversible calculations across branches can be a good idea

The DSP chips are much more fun to schedule in the compiler as branches are usually cheaper and there's probably no reordering happening at runtime.