Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Note to others reading along: in the last appendix page the OP paper reports DFloat11 reduces tokens/sec by ~2-3x for the Llama-3.1-8b and Qwen-2.5-14b/32b and Mistral-small-24b models (throughput penalty not reported for others).

Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU.

Classic comp sci tradeoff between space and speed, no free lunch, etc.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: