> In 5 years consumer chips and model inference will be so good you won't need a...

0xbadcafebee · 2026-04-17T22:20:56 1776464456

The thing is SOTA has a plateau. All LLMs work on the same principle: input goes in for training, reinforced by humans. There is only so much input (all recorded human knowledge), only so many human tweaks, that can produce only so much increased signal-to-noise in output. The machine can't read your mind, and there is no one truthful answer to most questions, so there will always be a limit on how accurate or correct or whatever any response will get. So at some point, you just can't make a better response. The agent harness, prompts, etc, are the only way to get better, and that's gonna be open source.

Add to that the algorithmic improvements on inference that's making inference faster with more context and higher quality. TurboQuant is just one example, more methods are coming out all the time. So the inference is getting more efficient.

At the same time, hardware can kind of keep getting infinitely better. Even if you can't make it smaller, you can make it more energy efficient, improve multitasking, more GPU cores/RAM or iGPUs, pack in more chips, improve cooling, use new materials... the sky's the limit.

Add all 3 together and at some point you will get Opus 4.7 on a phone with 40 t/s. At that point there's no way I'm paying for inference on a server. You can do RAG on-device, and image/video/voice is done by multi-modals. I want my agent chats replicated, but that's Google Drive. I want the agent to search the web, but that's Google Search. So eventually we're back to just doing what we do today (pre-AI) only with more automation.

The really advanced shit will come in 10 years, when we finally crack real memory and learning. That will absolutely be locked up in the cloud. But that's not an LLM, it's something else entirely. (slight caveat that WW3 will delay progress by 10-20 years)

topherhunt · 2026-04-22T06:49:19 1776840559

AFAICT this isn't how SOTA has worked, ever, since the term was invented. So far (again AFAICT) it's always been: Centralized highly-resourced nodes can deliver more technically impressive results, whereas cheaper lower-resource consumer hardware continually lags it. Your premise that "SOTA has a plateau" needs data; you're giving me some juicy plausible hypotheses about reasons why advances might hit a wall, but technology advances tend to find ways around those walls, do you disagree?

The history of computing is full of predictions that consumer hardware would catch up to server-class capability in X years, and the answer has consistently been, consumer hardware catches up to _yesterday's_ server capability while server capability has moved on to new more mind-blowing paradigms which would not be possible on consumer hardware for another half-decade or more.

I'm sure that specific scaling trajectories will hit specific ceilings, such that in specific ways, one can make the argument that (for example) today's iphone performs at parity with today's servers. In 5 minutes I can spin up the same Postgres or Mongo DB that the largest companies on earth use server-side, though I can't support anywhere near the same data & traffic volume. But parity along specific technical aspects is a very different matter from the broad prediction of "you won't need a server for SOTA".

To step back to the bigger context -- your original point seems more along the lines of "we're obviously in an unsustainable bubble, and the rapid progress in on-device AI will further exacerbate the embarrassing collapse of all these overhyped AI companies". I strongly agree with you. But I think that's likely _and also_ firmly predict that the technical SOTA of 2031 (and 2041, if we make it there), in nearly every imaginable aspect including language-capable AI, will be vastly more capable than what you can run in your pocket.