LLM in silicon is the future. It won't be long until you can just plug an LLM chip into your computer and talk to it at 100x the speed of current LLMs. Capability will be lower but their speed will make up for it.
You can always delegate sub agents to cloud based infrastructure for things that need more intelligence. But the future indeed is to keep the core interaction loop on the local device always ready for your input.
A lot of stuff that we ask of these models isn't all that hard. Summarize this, parse that, call this tool, look that up, etc. 99.999% really isn't about implementing complex algorithms, solving important math problems, working your way through a benchmark of leet programming exercises, etc. You also really don't need these models to know everything. It's nice if it can hallucinate a decent answer to most questions. But the smarter way is to look up the right answer and then summarize it. Good enough goes a long way. Speed and latency are becoming a key selling point. You need enough capability locally to know when to escalate to something slower and more costly.
This will drive an overdue increase in memory size of phones and laptops. Laptops especially have been stuck at the same common base level of 8-16GB for about 15 years now. Apple still sells laptops with just 8GB (their new Neo). I had a 16 GB mac book pro in 2012. At the time that wasn't even that special. My current one has 48GB; enough for some of the nicer models. You can get as much as 256GB today.
> This will drive an overdue increase in memory size of phones and laptops.
DRAM costs are still skyrocketing, so no, I don't think so. It's more likely that we'll bring back wear-resistant persistent memory as formerly seen with Intel Optane.
Standard pig cycle in economics. Production capacity eventually goes up to meet demand and prices come down again. RAM has been going through cycles like this for decades. People seem to have no memory whatsoever of previous cycles every time it happens. Just wait a few years for it to become cheap again.
I think there are drastic differences between computer vision models and LLMs that you’re not considering. LLMs are huge relative to vision models, and require gobs of fast memory. For this reason a little USB dongle isn’t going to cut it.
Put another way, there already exist add-in boards like this, and they’re called GPUs.
Sure, but that’s somewhat orthogonal to the point I was making, which is that LLMs are huge in size. Even in the case of a custom “LLM chip,” you’ll need huge amounts of very fast storage of some sort (likely DRAM), which places constraints on the size, power consumption, and cost of such a device. This device, if it existed, would not in any way resemble the Coral TPU product that the GP was referencing; I think in fact it would be closer in size, price, and form factor to a GPU.