Note to others reading along: in the last appendix page the OP paper reports DFl...

Note to others reading along: in the last appendix page the OP paper reports DFloat11 reduces tokens/sec by ~2-3x for the Llama-3.1-8b and Qwen-2.5-14b/32b and Mistral-small-24b models (throughput penalty not reported for others).

Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU.

Classic comp sci tradeoff between space and speed, no free lunch, etc.