This flowchart hides the most awful parts (IMO) of x86 prefixes: some combinations of prefixes are invalid but still parsed and executed, like combining two segment overrides, or placing a legacy prefix after a REX prefix.
The CPU also doesn't care if you use prefixes that aren't valid for a specific instruction, for example a REP on a non-repeatable instruction. The LOCK prefix is the only prefix that makes the sane choice to reject invalid combinations, rather than silently accept them.
Also, the (E)VEX prefix doesn't behave like the other prefixes: it must be placed last, and can therefore only appear once. All other prefixes can be repeated.
> The CPU also doesn't care if you use prefixes that aren't valid for a specific instruction, for example a REP on a non-repeatable instruction.
This is one of the reasons why the x86 could be extended so much. PAUSE is just REP NOP, for example. Segment prefixes in front of conditional branches were used as static branch prediction hints (which I believe have returned in some newer Intel CPUs). Useful if you want to make a hint on newer CPUs that is harmless on older CPUs.
Some prefixes have become part of the encoding for certain SIMD instructions, but that is a different case because those prefixes aren't hints.
The correct behavior for allowing future extensions has already been introduced by Intel with 80186, in 1982, which has introduced an invalid instruction exception, to be used for all undefined instruction opcodes.
This behavior was unlike 8086/8088, which happily executed any undefined instructions, most of them being aliases to defined instructions.
For any opcode where current CPUs generate invalid instruction exceptions, it is very easy to define them in future CPUs to encode useful instructions. Had REP NOP generated exceptions in old CPUs, it would have been still fine for it to become PAUSE in current CPUs. Unfortunately, the designers of Intel CPUs have not always followed their own documentation, so not all invalid opcodes generate the exception, as they should. The non-enforcing of this condition has led to the existence of even commercial programs that are invalid or of compilers that generate officially invalid instructions.
It is true that there are a few cases when Intel has exploited the fact that some encodings were equivalent with a NOP on old CPUs, by reusing them for some instruction on new CPUs, where this allowed the execution of a program compiled for new CPUs on old CPUs. However this has been possible only for very few instructions, e.g. for branch direction hints, when not executing them on old CPUs does not change the result of a program.
In general the reuse of an opcode for a new instruction, when that opcode does not generate exceptions on old CPUs, is very dangerous, because the execution on old CPUs of a program compiled for new CPUs will have unpredictable consequences, like destroying some property of the user.
Your example with PAUSE is also one of the very few examples, besides branch hints, where the execution of a new program on old computers is not dangerous, despite the reassignment of the opcode.
Some time ago there was a discussion about a bug in some CPU, but I do not remember in which one, where the bug was triggered when the order of the REP prefix and of the 64-bit REX prefix was invalid, but the invalid order was ignored by the older CPUs instead of generating the appropriate exception, which allowed the execution of invalid programs, which did not have any bad effects on old CPUs, but they triggered the bug on that specific new CPU.
The new CPU should have been bug-free, but also the programs that triggered the bug should not have existed, as they should have crashed immediately on any older CPU.
There is utility for having a reserved set of opcode space for "NOP if you don't know what the semantics are, but later ISAs may attach semantics for it," because this allows you to add various instructions that merely do nothing on processors that don't support them. The ENDBR32/ENDBR64 instructions for CET, XACQUIRE/XRELEASE hints for LOCK, the MPX instructions, the PREFETCH instructions all use reserved NOP space (0F0D and 0F18-0F1F opcode space).
This is true, but the encoding space reserved for future extensions that is interpreted as NOP should be many times smaller than the space for encodings that generate the invalid instruction exception.
The reason is that the number of useful instructions that are only performance hints or checks for some exceptional conditions, so that if they are ignored that does not have bad consequences, is very limited.
For the vast majority of instruction set extensions, not executing the new instructions completely changes the behavior of the program, which is not acceptable, so the execution of such programs must be prevented on older CPUs.
Regarding the order of prefixes, Intel has made mistakes in not specifying it initially in 8086 and in allowing redundant prefixes. The latter has been partially corrected in later CPUs by imposing a limit for the instruction length.
Because of this lack of specification, the various compilers and assemblers have generated any instruction formats that were accepted by an 8088, so it became impossible to tighten the specification.
However, what is really weird is why Intel and AMD have continued to accept incrorrect instruction encodings even after later extensions of the ISA that clearly specified only a certain encoding to be valid, but in reality the CPUs also accept other encodings and now there are programs that use those alternative encodings that were supposed to be invalid.
The prefix structure has been enforced starting with the VEX prefixes (which is a lot later than it should have; AMD made a mistake not enforcing more rules around the REX prefix). The legacy prefixes are of course an unfixable mess because of legacy.
Yes, I wish this was this simple. :) There are many other complications:
* Some instructions require VEX.L or VEX.W to be 0 or 1, and some encodings result in completely different instructions if you change VEX.L.
* Different bits of the EVEX prefix are valid depending on the opcode byte.
* Some encodings (called groups) produce different instructions depending on bits 3-5 of the modrm byte (the second byte after all prefixes). Some encodings further produce different groups depending on whether bits 6-7 (mod) of the modrm byte identifies a register or not.
* Some instructions read a whole vector register but only a scalar if the same instruction has a memory operand. Sometimes this is clear in the manual, sometimes it is not, sometimes the manual is downright wrong.
* Some instructions do not allow using the legacy high-8-bits registers even though they don't do anything with bits 8 and above of the operand: they only want a 32- or 64-bit register as their operand.
* APX (EVEX map 4) looks a lot like legacy map 0, but actually a few instructions were moved there from other maps for good reasons, a few more were moved there for no apparent reason (SHLD/SHRD iirc), and a few more are new.
* REX2 does not extend SSE and AVX instructions to 32 registers even though REX does extend them to 16.
* Intel defines a thing called VEX instruction classes, which makes sense except for a dozen or two instructions where it doesn't. For these, sometimes AMD uses a different class, sometimes doesn't; sometimes AMD's choice makes sense, sometimes it doesn't.
And many more that I found out while writing QEMU's current x86 decoder (which tries to be table based but sometimes that's just impossible).
> Some instructions require VEX.L or VEX.W to be 0 or 1, and some encodings result in completely different instructions if you change VEX.L.
There is even an instruction where AMD got this wrong! VPERMQ requires VEX.W=1, but some AMD CPUs also happily execute it when VEX.W=0 even though that is supposed to raise an exception.
This is indeed a thing. I believe in general instructions are executed slower when there are more than 4 legacy prefixes. And there are plenty of other timing differences between different microarchitectures
> However, in this case it doesn’t matter; those top bits are discarded when the result is written to the 32-bit eax.
Fun (but useless) fact: This being x86, of course there are at least three different ways [1] to encode this instruction: the way it was shown, with an address size override prefix (giving `lea eax, [edi+esi]`), or with both a REX prefix and an address size override prefix (giving `lea rax, [edi+esi]`).
And if you have a segment with base=0 around you can also add in a segment for fun: `lea rax, cs:[edi+esi]`
[1]: not counting redundant prefixes and different ModRMs
Evaluating how much of instruction space we cover was indeed difficult.
Initially, we wanted to parse Intel XED's datafiles to generate a map of valid instruction space, but we ended up going for the simpler approach of computing coverage by selecting instructions randomly and from real-world binaries because of time constraints.
From Table 7 you can get an idea of how many instruction variants we cover (~1500 covered, ~700 enumerated but not synthesized, 744 out of enumeration scope).
Instruction variants correspond much more closely with the mnemonics listed in the reference manuals, and this is typically the number reported by related work.
Yes, but I still think this falls victim to the problem I mentioned: you might have two dozen arithmetic instructions, and two that change privilege state. It is generally the latter that is more interesting to those doing this kind of analysis. (Not saying that the former is completely useless; I am sure emulator developers and similar would find it interesting. But most of the research effort going into finding new instructions or whatever is going towards the not-simple instructions.)
This may be a really dumb question, but is that much of the behavior of an x86_64 CPU variable and undefined? Until recently I thought the chipmakers provided full information (recently I found an article about people investigating the undocumented innards of the 286, IIRC). This seems like a pretty shaky foundation for software.
Documentation is definitely not one of x86's strengths. Other architectures do much better. For example, ARM provides formal models of their CPUs, and RISC-V is so simple you could implement all its semantics in a few thousand lines of code.
There are quite a few instructions with undefined behavior, but it is not that much of an issue if you can choose to avoid it -- for example in a compiler.
Almost all UB is found in flags or when using invalid instruction prefixes.
And although there is some unexpected UB, like `imul`'s zero flag being UB instead of being set according to the result of the multiplication [1], reading the manual and sticking to the parts that are clearly not UB gets you most of the way.
However, it becomes an issue if you need to analyze a binary that uses UB.
Then you can't choose which instructions to use, so you need to have a complete model of all UB.
That's much more difficult, and for example most decompilers currently fail at this.
We have an example of this in Figure 1 of our paper.
The CPU also doesn't care if you use prefixes that aren't valid for a specific instruction, for example a REP on a non-repeatable instruction. The LOCK prefix is the only prefix that makes the sane choice to reject invalid combinations, rather than silently accept them.
Also, the (E)VEX prefix doesn't behave like the other prefixes: it must be placed last, and can therefore only appear once. All other prefixes can be repeated.