I've been running this on my laptop with the Unsloth 20.9GB GGUF in LM Studio: h...

GistNoesis · 2026-04-16T22:24:37 1776378277

Thanks for pointing to the GGUF.

I just tried this GGUF with llama.cpp in its UD Q4_K_XL version on my custom agentic oritened task consisiting of wiki exploration and automatic database building ( https://github.com/GistNoesis/Shoggoth.db/ )

I noted a nice improvement over QWen3.5 in its ability to discover new creatures in the open ended searching task, but I've not quantified it yet with numbers. It also seems faster, at around 140 token/s compared to 100 token/s , but that's maybe due to some different configuration options.

Some little difference with QWen3.5 : to avoid crashes due to lack of memory in multimodal I had to pass --no-mmproj-offload to disable the gpu offload to convert the images to tokens otherwise it would crash for high resolutions images. I also used quantized kv store by passing -ctk q8_0 -ctv q8_0 and with a ctx-size 150000 it only need 23099 MiB of device memory which means no partial RAM offloading when I use a RTX 4090.

kelnos · 2026-04-16T20:23:47 1776371027

I'm not sure how you can give the flamingo win to Qwen:

* It's sitting on the tire, not the seat.

* Is that weird white and black thing supposed to be a beak? If so, it's sticking out of the side of its face rather than the center.

* The wheel spokes are bizarre.

* One of the flamingo's legs doesn't extend to the pedal.

* If you look closely at the sunglasses, they're semi-transparent, and the flamingo only has one eye! Or the other eye is just on a different part of its face, which means the sunglasses aren't positioned correctly. Or the other eye isn't.

* (subjective) The sunglasses and bowtie are cute, but you didn't ask for them, so I'd actually dock points for that.

* (subjective) I guess flamingos have multiple tail feathers, but it looks kinda odd as drawn.

In contrast, Opus's flamingo isn't as detailed or fancy, but more or less all of it looks correct.

withinboredom · 2026-04-16T21:50:27 1776376227

He literally said it came down to the comment in the SVG. Points for taste, not correctness. Basically.

realityfactchex · 2026-04-16T22:55:41 1776380141

Here's a reproduction attempt (LM Studio, same Qwen3.6-35B-A3B-GGUF model as linked in parent, M1 Max 64GB, <90 seconds):

https://files.catbox.moe/r3oru2.png

- My Qwen 3.6 result had sun and cloud in sky, similar to the second Opus 4.7 result in Simon's post.

- My Qwen 3.6 result had no grass (except as a green line), but all three results in Simon's post had grass (thick).

- My Qwen 3.6 result had visible "tailing air motion" like Simon's Qwen 3.6 result.

- My Qwen 3.6 result had a "sun with halo" effect that none of Simon's results had.

But, I know, it's more about the pelican and the bicycle.

_ache_ · 2026-04-17T05:58:37 1776405517

The bicycle frame is ok. Simon's was better but at least it's not broken like Opus 4.7.

I can't comment that flamingo.

jubilanti · 2026-04-16T18:16:42 1776363402

I wonder when pelican riding a bicycle will be useless as an evaluation task. The point was that it was something weird nobody had ever really thought about before, not in the benchmarks or even something a team would run internally. But now I'd bet internally this is one of the new Shirley Cards.

abustamam · 2026-04-16T18:41:07 1776364867

Simon has an article on this

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

SwellJoe · 2026-04-16T22:03:22 1776377002

Pelicanmaxxing

amelius · 2026-04-16T19:59:01 1776369541

Yeah try it with something else, or e.g. add a tiger to the back seat.

survirtual · 2026-04-17T10:36:52 1776422212

I use this metric now, and I suggest you change it per your imagination:

"Make a single-page HTML file using threejs from a CDN. Render a scene of a flying dinosaur orbiting a planet. There are clouds with thunder and lightning, and the background is a beautiful starscape with twinkling stars and a colorful nebula"

This allows me to evaluate several factors across models. It is novel and creative. I generally run it multiple times, though now that I have shared it here, I will come up with new scenes personally to evaluate.

I also consider how well it one shots, errors generated, response to errors being corrected, and velocity of iteration to improvement.

Generally speaking, Claude Sonnet has done the best, Qwen3.5 122B does second, and I have nice results from Qwen3.5 35B.

ChatGPT does not do well. It can complete the task without errors but the creativity is atrocious.

MagicMoonlight · 2026-04-16T19:25:05 1776367505

They’ll hardcode it in 4.8, just like they do when they need to “fix” other issues

rafaelmn · 2026-04-16T18:41:34 1776364894

I mean look at the result where he asked about a unicycle - the model couldn't even keep the spokes inside the wheels - would be rudimentary if it "learned" what it means to draw a bicycle wheel and could transfer that to unicycle.

duzer65657 · 2026-04-16T19:27:19 1776367639

it's the frame that's surprisingly - and consistentnly - wrong. You'd think two triangles would be pretty easy to repro; once you get that the rest is easy. It's not like he's asking "draw a pelican on a four-bar linkage suspension mountainbike..."

Reddit_MLP2 · 2026-04-16T19:59:06 1776369546

This is older, but even humans don't have a great concept of how a bicycle works... https://twistedsifter.com/2016/04/artist-asks-people-to-draw...

yndoendo · 2026-04-16T20:31:23 1776371483

Wouldn't this be more about being capable of mentally remembering how a bicycle looks versus how it works?

This reminds me of Pictionary. [0] Some people are good and some are really bad.

I am really bad a remembering how items look in my head and fail at drawing in Pictionary. My drawing skills are tied to being able to copy what I see.

[0] https://en.wikipedia.org/wiki/Pictionary

johanvts · 2026-04-17T06:21:32 1776406892

I think it’s difficult to draw a bike exactly because you remember how it works rather than how it looks, so you worry about placing all the functional parts and get the overall composition wrong. Similar to drawing faces, without training, people will consistently dedicate too much area to the lower part of the face and draw some kind of neanderthal with no forehead.

quinnjh · 2026-04-16T21:38:46 1776375526

is it possible to have greater success with the specificity? I don't think i ever drew a bike frame properly as a kid despite riding them and understanding the concept of spokes and wheels...

hansmayer · 2026-04-17T11:43:35 1776426215

Valid points, but you"d think "superintelligence" would "know" how to draw a pelican on a bike?

bertili · 2026-04-16T17:53:58 1776362038

It's fascinating that a $999 Mac Mini (M4 32GB) with almost similar wattage as a human brain gets us this far.

johanvts · 2026-04-17T06:33:30 1776407610

Interesting thought, I looked it up out of curiosity and fund 155w max (but realistically more like 80w sustained) for the mac under load, and just around 20watts for the brain, surprisingly almost constant whether “under load” or not.

petu · 2026-04-17T10:27:22 1776421642

> 155w max (but realistically more like 80w sustained)

155W PSU seems to be unified with M4 Pro model, plus there's reserve for peripherals (~55W for 5 USB/Thunderbolt ports).

Apple lists 65W for base M4 Mac itself: https://support.apple.com/en-am/103253

Notebookcheck found same number: https://www.notebookcheck.net/Apple-Mac-Mini-M4-review-Small...

fragmede · 2026-04-17T16:34:31 1776443671

I clocked my M4 at 108 Watts while running inference using Qwen3.6-35b-a3b via Al dente.

culi · 2026-04-16T19:06:52 1776366412

the more I look at these images the more convinced I become that world models are the major missing piece and that these really are ultimately just stochastic sentence machines. Maybe Chomsky was right

bmitc · 2026-04-17T06:10:51 1776406251

> that these really are ultimately just stochastic sentence machines

I thought that's exactly what they are?

culi · 2026-04-17T19:44:04 1776455044

No, they have "attention". There is unique logic going on in the deep layers of the neural network.

Even the standard introductory exercise artificial neural networks, handwritten digit recognition, already shows deeper understanding. These simple networks take in raw pixels and somewhere in the many layers recognize "curves" and "edges" and then "circles" and "boxes" and whatnot and eventually "digits".

I think there's a genuine debate about whether or not this is a form of intelligence. I think the oversimplified argument of them just being stochastic sentence machines mostly comes from people who don't understand how they work. But I also think there's a much more nuanced version of this argument offered by people like Chomsky that should be taken seriously

bmitc · 2026-04-18T05:49:15 1776491355

> No, they have "attention". There is unique logic going on in the deep layers of the neural network.

Any specifics? That doesn't say anything about them not being sentence generators. And it's pretty well known that the LLMs constantly spew out fantastically grammatically correct sentences that have no logic to them whatsoever.

> These simple networks take in raw pixels and somewhere in the many layers recognize "curves" and "edges" and then "circles" and "boxes" and whatnot and eventually "digits".

That sounds like a version of anthropomorphizing. It is my understanding that it is a completely open problem as to what neural networks are actually doing in their internal, deep layers.

> I think the oversimplified argument of them just being stochastic sentence machines mostly comes from people who don't understand how they work.

I mean, that's effectively a logical fallacy, so it's not a strong argument.

mastermage · 2026-04-17T06:28:24 1776407304

I am so perplexed what exactly where people thinking they were. Its nothing else than highly sofisticated statistics.

tmountain · 2026-04-17T08:53:52 1776416032

From that perspective, which is totally correct, it makes you wonder what other domains of knowledge look like when pushed to the boundaries of our capabilities as a species.

mastermage · 2026-04-18T09:25:04 1776504304

That is a genuinely thought provoking idea.

culi · 2026-04-17T19:49:16 1776455356

Do you know of any other statistical model that can "hallucinate". They clearly have emergent capabilities that come from scale that are absent in any other statistical model we've ever dreamt up.

We know that LLMs build complex internal representations of language, logic, and concepts rather than just shallow word-counting.

If you deny that then you probably have an elementary understanding of how they work. Not even Chomsky denies that. The real argument imo is whether those internal representations constitute an actual "understanding" of the world or just flatten out to something much less interesting.

mastermage · 2026-04-18T09:29:50 1776504590

> Do you know of any other statistical model that can "hallucinate".

Actualy most statistical models can "hallucinate", specifically those that are capable of interpolation.

I have witnessed this for example in Gaussian Processes. In my own scientific work.

060880 · 2026-04-17T15:47:40 1776440860

The Chomsky argument feels like it's moving in a different direction than what's actually useful to know. Whether or not these models have "real" understanding, they're clearly capable of solving problems that were previously considered to require understanding. The more interesting question is whether world models, if they existed, would actually improve the failure modes people care about — like hallucination and planning — or whether we'd just get better stochastic sentence machines with an extra layer of abstraction on top.

cyclopeanutopia · 2026-04-16T17:59:58 1776362398

But that you also gave a win to Qwen on flamingo is pretty outrageous! :)

Tthe right one looks much better, plus adding sunglasses without prompting is not that great. Hopefully it won't add some backdoor to the generated code without asking. ;)

simonw · 2026-04-16T18:13:44 1776363224

I love how the Chinese models often have an unprompted predilection to add flair.

GLM-5.1 added a sparkling earring to a north Virginia opossum the other day and I was delighted: https://simonwillison.net/2026/Apr/7/glm-51/

monksy · 2026-04-16T20:57:45 1776373065

You're running 5.1 locally or hosted?

simonw · 2026-04-16T21:37:43 1776375463

I used that one via OpenRouter.

prirun · 2026-04-16T18:41:12 1776364872

The flamingo on Qwen's unicycle is sitting on the tire, not the seat. That wins because of sunglasses?

evilduck · 2026-04-16T19:38:22 1776368302

Can a benchmark meant as a joke not use a fun interpretation of results? The Qwen result has far better style points. Fun sunglasses, a shadow, a better ground, a better sky, clouds, flowers, etc.

If we want to get nitty gritty about the details of a joke, a flamingo probably couldn't physically sit on a unicycle's seat and also reach the pedals anyways.

akavel · 2026-04-16T20:37:07 1776371827

Well, maybe the flamingo is a really good unicyclist...

https://youtu.be/Rrpgd5oIKwI

yabutlivnWoods · 2026-04-17T05:37:36 1776404256

Transparency of the wheel

Stylized gradients on the flamingo

Flowers

Ground/grass has a stylized look and feel

...despite a miss along the Y-axis where it's below the seat, couple oddly organized tail feathers, spokes, the composition overall is much closer to a production quality entity

Opus 4.7 looks like 20 seconds in MS paint.

Qwen3.6 looks incomplete due to the sitting position, but like a WIP I could see on a designer coworkers screen if I walk up and interrupt them. Click and drag it up, adjust tail feathers and spokes, you're there or much closer, to a usable output

rdslw · 2026-04-16T19:11:44 1776366704

interesting, I just tried this very model, unsloth, Q8, so in theory more capable than Simon's Q4, and get those three "pelicans". definitely NOT opus quality. lmstudio, via Simon's llm, but not apple/mlx. Of course the same short prompt.

Simon, any ideas?

https://ibb.co/gFvwzf7M

https://ibb.co/dYHRC3y

https://ibb.co/FLc6kggm (tried here temperature 0.7 instead of pure defaults)

strobe · 2026-04-17T01:04:00 1776387840

try Unsloth recommended settings

    Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

    Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

    Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

    Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

(Please note that the support for sampling parameters varies according to inference frameworks.)

monksy · 2026-04-16T20:59:55 1776373195

Hey I really enjoy your blog. On some things I end up finding a blog post of yours thats a year+ old and at other times, you and I are investigating similar things. I just pulled Qwen3.6 - 35b -A3B (Can't believe thats a A3B coming from 35b).

I'm impressed about the reach of your blog, and I'm hoping to get into blogging similar things. I currently have a lot on my backlog to blog about.

In short, keep up the good work with an interesting blog!

jamwise · 2026-04-16T17:41:56 1776361316

I've had some really gnarly SVGs from Claude. Here's what I got after many iterations trying to draw a hand: https://imgur.com/a/X4Jqius

giantg2 · 2026-04-16T18:50:38 1776365438

Probably because all the training material of humans drawing hands are garbage haha.

jaspanglia · 2026-04-16T22:25:56 1776378356

The real question is what the next truly weird, un-optimized prompt will be. Something involving a sloth debugging a quantum computer in MS Paint?"

quietsegfault · 2026-04-16T23:37:32 1776382652

The qwen flamingo looks like it’s smoking’ a doobie.

MeteorMarc · 2026-04-16T19:11:19 1776366679

Interesting, qwen has the pelican driving on the left lane. Coincidence or has it something to do with the workers providing the RL data?

rubiquity · 2026-04-16T19:24:19 1776367459

Could be on a bike path where bikes are on the left and pedestrians to the right.

Scrounger · 2026-04-17T06:43:36 1776408216

I've been running qwen3.6:35b-a3b-q4_K_M (22.3GB) via Ollama.

Is the 20.9GB GGUF version better or negligible in comparison?

bwv848 · 2026-04-16T21:00:29 1776373229

I've been trying the Q4_K_M version, and sometimes it gets stuck in a loop. Gemma 4 doesn’t have this issue.

yencabulator · 2026-04-16T21:33:04 1776375184

This has happened before with quantizations and other backends (ones not used by the research lab). Give it a week, download latest versions of everything, and try again.

mobiuscog · 2026-04-17T12:22:31 1776428551

I'm having the same issues, the more I use it. The repetition penalty doesn't seem to help.

I get some really amusing 'reflective' responses, but I think it needs a bit more cooking. Maybe I'll try another variant.

Readerium · 2026-04-17T01:10:27 1776388227

perhaps increasing repitition_penalty might be helpful

danielhanchen · 2026-04-16T17:50:28 1776361828

Oh that is pretty good! And the SVG one!

logicallee · 2026-04-17T11:18:19 1776424699

what kind of specs does your laptop have? do you know how many tokens/second you get on it?

slekker · 2026-04-16T17:48:38 1776361718

How does it do with the "car wash" benchmark? :D