> Beyond solid benchmarks, Alibaba's power move was dropping a bunch of models a...

> Beyond solid benchmarks, Alibaba's power move was dropping a bunch of models available to use and run locally today.

I agree, the advantage of qwen3's family is a plethora of sizes and architectures to chose from. Another one is ease of fine-tuning for downstream tasks.

On the other hand, I'd say it's "in spite" of their benchmarks, because there's obviously something wrong with either the published results, or the way they measure them, or something. Early impressions do not support those benchmarks at all. At one point they even had a 4b model be better than their prev gen 72b model, which was pretty solid on its own. Take benchmarks with a huge boulder of salt.

Something is messing with recent benchmarks, and I don't know exactly what but I have a feeling that distilling + RL + something in their pipelines is making benchmark data creep into the models, either by reward hacking, or other signals getting leaked (i.e. prev gen models optimised for one benchmark are "distilling" those signals into newer smaller models. No, a 4b model is absoulutely not gonna be better than 4o/sonnet3.7, whatever the benchmarks say).