I can imagine a world where "good enough" GPGPUs become embedded in common chips...

menaerus · on Dec 2, 2024

We already have something similar in terms of HW accelerators for AI workloads in recent CPU designs but that's not enough.

LLM inference workloads are bound by the compute power, sure, but that's not insurmountable IMO. Much bigger challenge is memory. Not even the bandwidth but just a sheer amount of RAM you need to just load the LLM weights.

Specifically, even a single H100 will hardly suffice to host a mid-sized LLM such as llama3.1-70B. And H100 is ~50k.

If that memory amount requirement is there to stay, and with current LLM transformer architecture it is, then what is really left as an only option for affordable consumer HW are only the smallest and least powerful LLMs. I can't imagine having a built-in GPGPU with 80G of on-die memory. IMHO.