I recall a Qwen exec posted a public poll on Twitter, asking which model from Qw...

zozbot234 · 2026-04-16T14:38:37 1776350317

The 27B model is dense. Releasing a dense model first would be terrible marketing, whereas 35A3B is a lot smarter and more quick-witted by comparison!

arxell · 2026-04-16T14:51:59 1776351119

Each has it's pros and cons. Dense models of equivalent total size obviously do run slower if all else is equal, however, the fact is that 35A3B is absolutely not 'a lot smarter'... in fact, if you set aside the slower inference rates, Qwen3.5 27B is arguably more intelligent and reliable. I use both regularly on a Strix Halo system... the Just see the comparison table here: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF . The problem that you have to acknowledge if running locally (especially for coding tasks) is that your primary bottleneck quickly becomes prompt processing (NOT token generation) and here the differences between dense and MOE are variable and usually negligible.

Mikealcl · 2026-04-16T16:25:13 1776356713

Could you explain why prompt processing is the bottle neck please? I've seen this behavior but I don't understand why.

zozbot234 · 2026-04-16T16:38:58 1776357538

You should be able to save a lot on prefill by stashing KV-cache shared prefixes (since KV-cache for plain transformers is an append-only structure) to near-line bulk storage and fetching them in as needed. Not sure why local AI engines don't do this already since it's a natural extension of session save/restore and what's usually called prompt caching.

FuckButtons · 2026-04-16T22:56:14 1776380174

if I understand you correctly, this is essentially what vllm does with their paged cache, if I’ve misunderstood I apologize.

zozbot234 · 2026-04-16T23:08:47 1776380927

Paged Attention is more of a low-level building block, aimed initially at avoiding duplication of shared KV-cache prefixes in large-batch inference. But you're right that it's quite related. The llama.cpp folks are still thinking about it, per a recent discussion from that project: https://github.com/ggml-org/llama.cpp/discussions/21961

nunodonato · 2026-04-16T16:39:38 1776357578

I was hoping this would be the model to replace our Qwen3.5-27B, but the difference is marginally small. Too risky, I'll pass and wait for the release of a dense version.

JKCalhoun · 2026-04-16T16:23:01 1776356581

"…whereas 35A3B is a lot smarter…"

Must. Parse. Is this a 35 billion parameter model that needs only 3 billion parameters to be active? (Trying to keep up with this stuff.)

EDIT: A later comment seems to clarify:

"It's a MoE model and the A3B stands for 3 Billion active parameters…"

halJordan · 2026-04-16T18:47:55 1776365275

That makes no sense. If you were just going to release the "more hype-able because it's quicker" model then why have a a poll.

Miraste · 2026-04-16T14:51:00 1776351060

What? 35B-A3B is not nearly as smart as 27B.

stratos123 · 2026-04-16T20:50:10 1776372610

One interesting thing about Qwen3 is that looking at the benchmarks, the 35B-A3B models seem to be only a bit worse than the dense 27B ones. This is very different from Gemma 4, where the 26B-A4B model is much worse on several benchmarks (e.g. Codeforces, HLE) than 31B.

zozbot234 · 2026-04-16T21:19:02 1776374342

> This is very different from Gemma 4, where the 26B-A4B model is much worse on several benchmarks (e.g. Codeforces, HLE) than 31B.

Wouldn't you totally expect that, since 26A4B is lower on both total and active params? The more sensible comparison would pit Qwen 27B against Gemma 31B and Gemma 26A4B against Qwen 35A3B.

Hugsun · 2026-04-17T11:49:29 1776426569

They're comparing Qwen's moe vs dense (smaller difference) against Gemma's moe vs dense (bigger difference). Your proposed alternative misses the point.

zozbot234 · 2026-04-17T12:04:09 1776427449

Gemma's dense is bigger than its moe's total parameters. You could totally expect the moe to do terribly by comparison.

ekianjo · 2026-04-16T14:56:41 1776351401

yeah the 27B feels like something completely different. If you use it on long context tasks it performs WAY better than 35b-a3b

Der_Einzige · 2026-04-16T15:27:58 1776353278

I've been telling analysts/investors for a long time that dense architectures aren't "worse" than sparse MoEs and to continue to anticipate the see-saw of releases on those two sub-architectures. Glad to continuously be vindicated on this one.

For those who don't believe me. Go take a look at the logprobs of a MoE model and a dense model and let me know if you can notice anything. Researchers sure did.

reissbaker · 2026-04-17T10:36:36 1776422196

Dense is (much) worse in terms of training budget. At inference time, dense is somewhat more intelligent per bit of VRAM, but much slower, so for a given compute budget it's still usually worse in terms of intelligence-per-dollar even ignoring training cost. If you're willing to spend more you're typically better off training and running a larger sparse model rather than training and running a dense one.

Dense is nice for local model users because they only need to serve a single user and VRAM is expensive. For the people training and serving the models, though, dense is really tough to justify. You'll see small dense models released to capitalize on marketing hype from local model fans but that's about it. No one will ever train another big dense model: Llama 3.1 405B was the last of its kind.

Der_Einzige · 2026-04-17T14:42:23 1776436943

You want to take bets on this? I'm willing to bet 500USD that an open access dense model of at least 300B is released by some lab within 3 years.

naasking · 2026-04-17T03:55:45 1776398145

MoE isn't inherently better, but I do think it's still an under explored space. When your sparse model can do 5 runs on the same prompt in the same time as a dense model takes to generate one, there opens up all sorts of interesting possibilities.

zkmon · 2026-04-16T14:55:35 1776351335

throwdbaaway · 2026-04-16T22:58:20 1776380300

Based on the release schedule of 3.5 previously, my optimistic take is that they distill the small models from the 397B, and it is much faster to distill a sparse A3B model. Hopefully the other variants will be released in the coming days.

arunkant · 2026-04-16T14:46:58 1776350818

Probably coming next

zkmon · 2026-04-16T14:51:58 1776351118

I'm guessing 3.5-27b would beat 3.6-35b. MoE is a bad idea. Because for the same VRAM 27b would leave a lot more room, and the quality of work directly depends on context size, not just the "B" number.

zozbot234 · 2026-04-16T14:59:47 1776351587

MoE is not a bad idea for local inference if you have fast storage to offload to, and this is quickly becoming feasible with PCIe 5.0 interconnect.

perbu · 2026-04-16T16:50:53 1776358253

MoE is excellent for the unified memory inference hardware like DGX Sparc, Apple Studio, etc. Large memory size means you can have quite a few B's and the smaller experts keeps those tokens flowing fast.