> In 5 years consumer chips and model inference will be so good you won't need a server for SOTA.
Naw man, you crazy. If you tell me that in 5 years, consumer chips will be so good that I can run GPT-5.4-level AI on my phone, I'd find that plausible (I buy cheap phones). If you're telling me that in 5 years we won't need _servers_ because our _phones and/or desktops_ will be powerful enough to run the biggest newest LLMs in existence, I question your judgment, I think that prediction shows a deep uncreativity about how massively compute-hungry SOTA models will get.
The valuable things to do with inference will keep being a server niche because they'll keep being 1-2 OOM more compute-hungry than whatever consumer hardware can handle. Like gaming: my laptop can run games from 2015 at max settings no problem but the games actually worth getting excited about in 2026 still melt a $2k GPU, because whatever headroom the hardware gains, developers immediately spend on ray tracing and Nanite and modelling individual skin cells or whatever. I don't see any plausible reason to expect that the ceiling on "valuable server-side compute" or "inference capacity" will rise any more slowly than the on-device capability is rising.
My assumption is that in 2031, SOTA top-intelligence AI will be hosted on cloud servers like it is today, offering dirt-cheap access to capabilities we can't even dream of today, while your Android will be running some open-source GPT-5+ equivalent.
The thing is SOTA has a plateau. All LLMs work on the same principle: input goes in for training, reinforced by humans. There is only so much input (all recorded human knowledge), only so many human tweaks, that can produce only so much increased signal-to-noise in output. The machine can't read your mind, and there is no one truthful answer to most questions, so there will always be a limit on how accurate or correct or whatever any response will get. So at some point, you just can't make a better response. The agent harness, prompts, etc, are the only way to get better, and that's gonna be open source.
Add to that the algorithmic improvements on inference that's making inference faster with more context and higher quality. TurboQuant is just one example, more methods are coming out all the time. So the inference is getting more efficient.
At the same time, hardware can kind of keep getting infinitely better. Even if you can't make it smaller, you can make it more energy efficient, improve multitasking, more GPU cores/RAM or iGPUs, pack in more chips, improve cooling, use new materials... the sky's the limit.
Add all 3 together and at some point you will get Opus 4.7 on a phone with 40 t/s. At that point there's no way I'm paying for inference on a server. You can do RAG on-device, and image/video/voice is done by multi-modals. I want my agent chats replicated, but that's Google Drive. I want the agent to search the web, but that's Google Search. So eventually we're back to just doing what we do today (pre-AI) only with more automation.
The really advanced shit will come in 10 years, when we finally crack real memory and learning. That will absolutely be locked up in the cloud. But that's not an LLM, it's something else entirely. (slight caveat that WW3 will delay progress by 10-20 years)
AFAICT this isn't how SOTA has worked, ever, since the term was invented. So far (again AFAICT) it's always been: Centralized highly-resourced nodes can deliver more technically impressive results, whereas cheaper lower-resource consumer hardware continually lags it. Your premise that "SOTA has a plateau" needs data; you're giving me some juicy plausible hypotheses about reasons why advances might hit a wall, but technology advances tend to find ways around those walls, do you disagree?
The history of computing is full of predictions that consumer hardware would catch up to server-class capability in X years, and the answer has consistently been, consumer hardware catches up to _yesterday's_ server capability while server capability has moved on to new more mind-blowing paradigms which would not be possible on consumer hardware for another half-decade or more.
I'm sure that specific scaling trajectories will hit specific ceilings, such that in specific ways, one can make the argument that (for example) today's iphone performs at parity with today's servers. In 5 minutes I can spin up the same Postgres or Mongo DB that the largest companies on earth use server-side, though I can't support anywhere near the same data & traffic volume. But parity along specific technical aspects is a very different matter from the broad prediction of "you won't need a server for SOTA".
To step back to the bigger context -- your original point seems more along the lines of "we're obviously in an unsustainable bubble, and the rapid progress in on-device AI will further exacerbate the embarrassing collapse of all these overhyped AI companies". I strongly agree with you. But I think that's likely _and also_ firmly predict that the technical SOTA of 2031 (and 2041, if we make it there), in nearly every imaginable aspect including language-capable AI, will be vastly more capable than what you can run in your pocket.
Naw man, you crazy. If you tell me that in 5 years, consumer chips will be so good that I can run GPT-5.4-level AI on my phone, I'd find that plausible (I buy cheap phones). If you're telling me that in 5 years we won't need _servers_ because our _phones and/or desktops_ will be powerful enough to run the biggest newest LLMs in existence, I question your judgment, I think that prediction shows a deep uncreativity about how massively compute-hungry SOTA models will get.
The valuable things to do with inference will keep being a server niche because they'll keep being 1-2 OOM more compute-hungry than whatever consumer hardware can handle. Like gaming: my laptop can run games from 2015 at max settings no problem but the games actually worth getting excited about in 2026 still melt a $2k GPU, because whatever headroom the hardware gains, developers immediately spend on ray tracing and Nanite and modelling individual skin cells or whatever. I don't see any plausible reason to expect that the ceiling on "valuable server-side compute" or "inference capacity" will rise any more slowly than the on-device capability is rising.
My assumption is that in 2031, SOTA top-intelligence AI will be hosted on cloud servers like it is today, offering dirt-cheap access to capabilities we can't even dream of today, while your Android will be running some open-source GPT-5+ equivalent.