Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respectiv...

kpw94 · 2026-04-02T17:05:17 1775149517

Wild differences in ELO compared to tfa's graph: https://storage.googleapis.com/gdm-deepmind-com-prod-public/...

(Comparing Q3.5-27B to G4 26B A4B and G4 31B specifically)

I'd assume Q3.5-35B-A3B would performe worse than the Q3.5 deep 27B model, but the cards you pasted above, somehow show that for ELO and TAU2 it's the other way around...

Very impressed by unsloth's team releasing the GGUF so quickly, if that's like the qwen 3.5, I'll wait a few more days in case they make a major update.

Overall great news if it's at parity or slightly better than Qwen 3.5 open weights, hope to see both of these evolve in the sub-32GB-RAM space. Disappointed in Mistral/Ministral being so far behind these US & Chinese models

culi · 2026-04-02T18:54:24 1775156064

You're conflating lmarena ELO scores.

Qwen actually has a higher ELO there. The top Pareto frontier open models are:

  model                        |elo  |price
  qwen3.5-397b-a17b            |1449 |$1.85
  glm-4.7                      |1443 | 1.41
  deepseek-v3.2-exp-thinking   |1425 | 0.38
  deepseek-v3.2                |1424 | 0.35
  mimo-v2-flash (non-thinking) |1393 | 0.24
  gemma-3-27b-it               |1365 | 0.14
  gemma-3-12b-it               |1341 | 0.11
  gpt-oss-20b                  |1318 | 0.09
  gemma-3n-e4b-it              |1318 | 0.03

https://arena.ai/leaderboard/text?viewBy=plot

What Gemma seems to have done is dominate the extreme cheap end of the market. Which IMO is probably the most important and overlooked segment

coder543 · 2026-04-02T20:30:28 1775161828

That Pareto plot doesn't seem include the Gemma 4 models anywhere (not just not at the frontier), likely because pricing wasn't available when the chart was generated. At least, I can't find the Gemma 4 models there. So, not particularly relevant until it is updated for the models released today.

coder543 · 2026-04-07T03:28:10 1775532490

Gemma 4 31B has now wiped out several of those models from the pareto frontier, now that it has pricing. Gemma 4 26B A4B has an Elo, but no pricing, so it still isn't on that chart. The Gemma 4 E2B/E4B models still aren't on the arena at all, but I expect them to move the pareto frontier as well if they're ever added, based on how well they've performed in general.

coder543 · 2026-04-02T17:19:56 1775150396

> Wild differences in ELO compared to tfa's graph

Because those are two different, completely independent Elos... the one you linked is for LMArena, not Codeforces.

nateb2022 · 2026-04-02T17:31:00 1775151060

> Very impressed by unsloth's team releasing the GGUF so quickly, if that's like the qwen 3.5, I'll wait a few more days in case they make a major update.

Same here. I can't wait until mlx-community releases MLX optimized versions of these models as well, but happily running the GGUFs in the meantime!

Edit: And looks like some of them are up!

FullyFunctional · 2026-04-03T03:44:09 1775187849

absolute n00b here is very confused about the many variations; it looks like the Mac optimized MX versions aren’t available in Ollama yet (I mostly use claude code with this)

gigatexal · 2026-04-02T18:15:20 1775153720

the benchmarks showing the "old" Chinese qwen models performing basically on par with this fancy new release kinda has me thinking the google models are DOA no? what am I missing?

bachmeier · 2026-04-02T18:05:23 1775153123

So is there something I can take from that table if I have a 24 GB video card? I'm honestly not sure how to use those numbers.

GistNoesis · 2026-04-02T18:25:15 1775154315

I just tried with llama.cpp RTX4090 (24GB) GGUF unsloth quant UD_Q4_K_XL You can probably run them all. G4 31B runs at ~5tok/s , G4 26B A4B runs at ~150 tok/s.

You can run Q3.5-35B-A3B at ~100 tok/s.

I tried G4 26B A4B as a drop-in replacement of Q3.5-35B-A3B for some custom agents and G4 doesn't respect the prompt rules at all. (I added <|think|> in the system prompt as described (but have not spend time checking if the reasoning was effectively on). I'll need to investigate further but it doesn't seem promising.

I also tried G4 26B A4B with images in the webui, and it works quite well.

I have not yet tried the smaller models with audio.

kpw94 · 2026-04-02T20:17:27 1775161047

> I'll need to investigate further but it doesn't seem promising.

That's what I meant by "waiting a few days for updates" in my other comment. Qwen 3.5 release, I remember a lot of complaints about: "tool calling isn't working properly" etc.

That was fixed shortly after: there was some template parsing work in llama.cpp. and unsloth pulled out some models and brought back better one for improving something else I can't quite remember, better done Quantization or something...

coder543 pointed out the same is happening regarding tool calling with gemma4: https://news.ycombinator.com/item?id=47619261

GistNoesis · 2026-04-02T20:34:36 1775162076

The model does call tools successfully giving sensible parameters but it seems to not picking the right ones in the right order.

I'll try in a few days. It's great to be able to test it already a few hours after the release. It's the bleeding edge as I had to pull the last from main. And with all the supply chain issues happening everywhere, bleeding edge is always more risky from a security point of view.

There is always also the possibility to fine-tune the model later to make sure it can complete the custom task correctly. But the code for doing some Lora for gemma4 is probably not yet available. The 50% extra speed seems really tempting.

amarshall · 2026-04-02T20:25:39 1775161539

If you are running on 4090 and get 5 t/s, then you exceeded your VRAM and are offloading to the CPU (or there is some other serious perf. issue)

mrinterweb · 2026-04-03T00:28:46 1775176126

Thank you. I have the same card, and I noticed the same ~100 TPS when I ran Q3.5-35B-A3B. G4 26B A4B running at 150TPS is a 50% performance gain. That's pretty huge.

refulgentis · 2026-04-02T19:15:19 1775157319

Reversing the X and Y axis, adding in a few other random models, and dropping all the small Qwens makes this worse than useless as a Qwen 3.5 comparison, it’s actively misleading. If you’re using AI, please don’t rush to copy paste output :/

EDIT: Lordy, the small models are a shadow of Qwen's smalls. See https://huggingface.co/Qwen/Qwen3.5-4B versus https://www.reddit.com/r/LocalLLaMA/comments/1salgre/gemma_4...

scrlk · 2026-04-02T19:44:10 1775159050

I transposed the table so that it's readable on mobile devices.

I should have mentioned that the Qwen 3.5 benchmarks were from the Qwen3.5-122B-A10B model card (which includes GPT-5-mini and GPT-OSS-120B); apologies for not including the smaller Qwen 3.5 models.

refulgentis · 2026-04-02T19:50:36 1775159436

It’s not readable on a phone either. Text wraps. unless you’re testing on foldable?

BloondAndDoom · 2026-04-02T22:38:56 1775169536

Small qwen models are magical

refulgentis · 2026-04-02T23:46:56 1775173616

It's so so good.

I have an app I've been working on for 2.5 years and felt kinda stupid making sure llama.cpp worked everywhere, including Android and iOS.

The 0.8B beats every <= 7B model I've used on tool use and can do RAG. Like you could ship it to someone who didn't know AI and it can do all the basics and leave UX intact.