More

dpe82 · 2026-02-17T18:24:27 1771352667

Sonnet 4.5 was a pretty significant improvement over Opus 4.

simianwords · 2026-02-17T18:25:22 1771352722

Yes but it’s easier to understand difference between 4.5 sonnet and opus and apply that difference to opus 4.6

dpe82 · 2026-02-17T18:12:57 1771351977

It's wild that Sonnet 4.6 is roughly as capable as Opus 4.5 - at least according to Anthropic's benchmarks. It will be interesting to see if that's the case in real, practical, everyday use. The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.

madihaa · 2026-02-17T18:27:09 1771352829

The most exciting part isn't necessarily the ceiling raising though that's happening, but the floor rising while costs plummet. Getting Opus-level reasoning at Sonnet prices/latency is what actually unlocks agentic workflows. We are effectively getting the same intelligence unit for half the compute every 6-9 months.

scottmf · 2026-02-18T04:26:50 1771388810

2024: Intelligence too cheap to meter

2026: Everyone is spending $500/month on LLM subscriptions

qingcharles · 2026-02-18T19:50:38 1771444238

My Dad used to make the same joke in the 1980s about how they'd told him in the 1950s that nuclear power would be "too cheap to meter" which I assume is probably where the trope originated.

mooreds · 2026-02-17T21:24:11 1771363451

> We are effectively getting the same intelligence unit for half the compute every 6-9 months.

Something something ... Altman's law? Amodei's law?

Needs a name.

merlindru · 2026-02-18T00:02:51 1771372971

How about More's law - because we keep getting "more" compute at a lower cost?

turnsout · 2026-02-17T21:49:15 1771364955

This is what excited me about Sonnet 4.6. I've been running Opus 4.6, and switched over to Sonnet 4.6 today to see if I could notice a difference. So far, I can't detect much if any difference, but it doesn't hit my usage quota as hard.

nimonian · 2026-02-17T23:17:22 1771370242

Moore's law lives on!

amelius · 2026-02-17T18:28:23 1771352903

> The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.

Yeah, but RAM prices are also back to 1990s levels.

mrcwinn · 2026-02-17T18:29:41 1771352981

Relief for you is available: https://computeradsfromthepast.substack.com/p/connectix-ram-...

isoprophlex · 2026-02-17T18:41:08 1771353668

You wouldn't download a RAM

MarsIronPI · 2026-02-18T03:10:29 1771384229

https://downloadmoreram.com

Yes I would.

Rapzid · 2026-02-18T08:17:20 1771402640

We don't rent RAMs!

mikkupikku · 2026-02-17T18:50:43 1771354243

I knew I've been keeping all my old ram sticks for a reason!

dpe82 · 2026-02-17T18:19:48 1771352388

simonw hasn't shown up yet, so here's my "Generate an SVG of a pelican riding a bicycle"

https://claude.ai/public/artifacts/67c13d9a-3d63-4598-88d0-5...

coffeebeqn · 2026-02-17T18:23:36 1771352616

We finally have AI safety solved! Look at that helmet

1f60c · 2026-02-17T18:26:59 1771352819

"Look ma, no wings!"

:D

thinkling · 2026-02-17T19:11:04 1771355464

For comparisonI think the current leader in pelican drawing is Gemini 3 Deep Think:

https://bsky.app/profile/simonwillison.net/post/3meolxx5s722...

konart · 2026-02-17T19:34:12 1771356852

My take (also Gemini 3 Deep Think): https://gemini.google.com/share/12e672dd39b7

Somehow it's much better now.

jazzyjackson · 2026-02-17T19:51:38 1771357898

I’m not familiar with Gemini, isn’t this just a diffusion model output? The Pelican test is for the llm to produce SVG markup.

konart · 2026-02-17T20:01:04 1771358464

Yeah, I was so amazed by the result I didn't even realize Gemini used Nano Banana while producing the result.

badc0ffee · 2026-02-18T17:53:46 1771437226

The point of the penny-farthing is that you drive the front wheel directly with the pedals, but this seems to have the pedals in a spot where they would drive a chain, although there is no chain?

kingbob000 · 2026-02-17T23:38:38 1771371518

Is that actually better? That pelican has arms sprouting out of its wings

AstroBen · 2026-02-17T18:27:56 1771352876

if they want to prove the model's performance the bike clearly needs aero bars

dyauspitr · 2026-02-17T19:20:28 1771356028

Can’t beat Gemini’s which was basically perfect.

satvikpendem · 2026-02-18T01:52:27 1771379547

> Sonnet 4.6 is roughly as capable as Opus 4.5 - at least according to Anthropic's benchmarks

Yeah it's really not. Sonnet still struggles while Opus, even 4.5 succeeds (and some examples show Opus 4.6 is actually even worse than 4.5, all while being more expensive and taking longer to finish).

justinhj · 2026-02-17T18:26:52 1771352812

We see the same with Google's Flash models. It's easier to make a small capable model when you have a large model to start from.

karmasimida · 2026-02-17T18:34:15 1771353255

Flash models are nowhere near Pro models in daily use. Much higher hallucinations, and easy to get into a death sprawl of failed tool uses and never come out

You should always take those claim that smaller models are as capable as larger models with a grain of salt.

justinhj · 2026-02-17T20:55:13 1771361713

Flash model n is generally a slightly better Pro model (n-1), in other words you get to use the previously premium model as a cheaper/faster version. That has value.

karmasimida · 2026-02-17T21:52:16 1771365136

They do have value, because they are much much cheaper.

But no, 3.0 flash is not as good as 2.5 pro, I use both of them extensively, especially in translation. 3.0 flash will confidently mistranslate some certain things, while 2.5 pro will not.

justinhj · 2026-02-18T00:20:38 1771374038

Totally fair. Translation is one of those specific domains where model size correlates directly with quality, and no amount of architectural efficiency can fully replace parameter count.

simlevesque · 2026-02-17T18:24:49 1771352689

The system card even says that Sonnet 4.6 is better than Opus 4.6 in some cases: Office tasks and financial analysis.

iLoveOncall · 2026-02-17T18:19:13 1771352353

Given that users prefered it to Sonnet 4.5 "only" in 70% of the cases (according to their blog post) makes me highly doubt that this is representative of real-life usage. Benchmarks are just completely meaningless.

jwolfe · 2026-02-17T18:34:08 1771353248

For cases where 4.5 already met the bar, I would expect 50% preference each way. This makes it kind of hard to make any sense of that number, without a bunch more details.

gnatolf · 2026-02-17T22:01:19 1771365679

Good point. So much functionality gets commoditized, we have to move goalposts more or less constantly.

ge96 · 2026-02-17T22:25:37 1771367137

I sent Opus a photo of NYC at night satellite view and it was describing "blue skies and cliffs/shore line"... mistral did it better, specific use case but yeah. OpenAI was just like "you can't submit a photo by URL". Was going to try Gemini but kept bringing up vertexai. This is with Langchain

danielbln · 2026-02-18T11:39:15 1771414755

I just sent Opus a NYC night satellite view and it described it just as expected. Seems like you have a tooling problem, not a model problem.

ge96 · 2026-02-18T14:12:53 1771423973

Would be curious your setup this was mine.

satellite_imagery_analysis_agent = create_agent( model="claude-opus-4-6", system_prompt="your task is to analyze satellite images" )

response = satellite_imagery_analysis_agent.invoke({ "messages": [ { "role": "user", "content": "What do you see in this satellite image? https://images.unsplash.com/photo-1446776899648-aa78eefe8ed0..." } ] })

With this output:

# Satellite Image Analysis

I can see this image shows an *aerial/satellite view of a coastline*. Here are the key features I can identify:

## Geographic Features - *Ocean/Sea*: A large body of deep blue water dominates a significant portion of the image - *Coastline*: A clearly defined boundary between land and water with what appears to be a rugged or natural shoreline - *Beach/Shore*: Light-colored sandy or rocky coastal areas visible along the water's edge

## Terrain - *Varied topography*: The land area shows a mix of greens and browns, suggesting: - Vegetated areas (green patches) - Arid or bare terrain (brown/tan areas) - *Possible cliffs or elevated terrain* along portions of the coast

## Atmospheric Conditions - *Cloud cover*: There appear to be some clouds or haze in parts of the image - Generally clear conditions allowing good visibility of surface features

## Notable Observations - The color contrast between the *turquoise/shallow nearshore waters* and the *deeper blue offshore waters* suggests varying ocean depths (bathymetry) - The coastline geometry suggests this could be a *peninsula, island, or prominent headland* - The landscape appears relatively *semi-arid* based on the vegetation patterns

---

Note: Without precise geolocation metadata, I'm providing a general analysis based on visible features. The image appears to capture a scenic coastal region, possibly in a Mediterranean, subtropical, or tropical climate zone.

Would you like me to focus on any specific aspect of this image?

estomagordo · 2026-02-17T18:24:47 1771352687

Why is it wild that a LLM is as capable as a previously released LLM?

crummy · 2026-02-17T18:27:18 1771352838

Opus is supposed to be the expensive-but-quality one, while Sonnet is the cheaper one.

So if you don't want to pay the significant premium for Opus, it seems like you can just wait a few weeks till Sonnet catches up

ceroxylon · 2026-02-17T18:57:40 1771354660

Strangely enough, my first test with Sonnet 4.6 via the API for a relatively simple request was more expensive ($0.11) than my average request to Opus 4.6 (~$0.07), because it used way more tokens than what I would consider necessary for the prompt.

svachalek · 2026-02-17T21:21:39 1771363299

This is an interesting trend with recent models. The smarter ones get away with a lot less thinking tokens, partially to fully negating the speed/price advantage of the smaller models.

smartbit · 2026-02-18T04:31:43 1771389103

Just like humans :-)

Eg a smart person will automate a task instead of executing the task repeatedly.

estomagordo · 2026-02-17T22:27:24 1771367244

Okay, thanks. Hard to keep all these names apart.

I'm even surprised people pay more money for some models than others.

tempestn · 2026-02-17T18:27:25 1771352845

Because Opus 4.5 was released like a month ago and state of the art, and now the significantly faster and cheaper version is already comparable.

micw · 2026-02-17T19:33:10 1771356790

"Faster" is also a good point. I'm using different models via GitHub copilot and find the better, more accurate models way to slow.

stavros · 2026-02-17T18:44:28 1771353868

Opus 4.5 was November, but your point stands.

tempestn · 2026-02-17T21:35:30 1771364130

Fair. Feels like a month!

simianwords · 2026-02-17T18:26:16 1771352776

It means price has decreased by 3 times in a few months.

Retr0id · 2026-02-17T18:26:17 1771352777

Because Opus 4.5 inference is/was more expensive.

dpe82 · 2026-02-16T19:55:34 1771271734

It's a perfectly reasonable choice: flexible, well specified, well supported, reasonably performant. I think the extreme level of hype 20 years ago was overdone and (just like with anything) there's good ways to adopt it and bad ways. But as a basic technology choice, it's fine. Particularly these days when you can have a coding agent write the parser boilerplate, etc. for you.

wolrah · 2026-02-16T21:34:48 1771277688

> It's a perfectly reasonable choice: flexible, well specified, well supported, reasonably performant. I think the extreme level of hype 20 years ago was overdone and (just like with anything) there's good ways to adopt it and bad ways. But as a basic technology choice, it's fine.

Absolutely with you up to here, but...

> Particularly these days when you can have a coding agent write the parser boilerplate, etc. for you.

Absolutely not. Having seen the infinite different ways a naive implementation of XML goes wrong, arguably being one of the main causes of death for XHTML because browsers rightfully rejected bad XML, "Don't roll your own XML implementation" should be right up there with "Don't roll your own crypto".

I don't feel like it's going out on a limb to say that if someone needs to defer to a LLM to implement XML they're not qualified to determine if it's done it right and/or catch what it got enthusiastically wrong.

dpe82 · 2026-02-17T03:36:18 1771299378

Oh sorry, I don't at all intend to say you should write your own parser! Totally agree: "Don't roll your own XML implementation"

What I was addressing is, interfacing with an XML parser and converting that into whatever your internal representation is, can be a chore. LLMs are great at that stuff.

wolrah · 2026-02-17T14:27:33 1771338453

I'm going to stand by my position there, if you're writing an application that's primary technical purpose is to communicate via an XML based protocol and feel the need to outsource the XML part to the bullshit machine, IMO you probably shouldn't be writing that application.

dpe82 · 2026-02-17T19:04:14 1771355054

To each their own.

dpe82 · 2026-02-16T06:46:26 1771224386

Sir Humphry Davy first isolated the stuff and he called it aluminum, so that's good enough for me.

st_goliath · 2026-02-16T07:23:15 1771226595

Well, the name Davy originally proposed was alumium.

I propose we switch to that instead, so everyone can be annoyed equally and in the same way.

dpe82 · 2026-02-16T07:41:38 1771227698

I accept your proposal; alumium it is.

dpe82 · 2026-02-15T21:44:00 1771191840

Also available on msn.com sans paywall: https://www.msn.com/en-us/news/us/he-spent-decades-perfectin...

ghostly_s · 2026-02-15T22:17:19 1771193839

Thanks, but the audio clips don't work here either.

dpe82 · 2026-02-15T19:25:53 1771183553

And https://en.wikipedia.org/wiki/Connection_Machine

rbanffy · 2026-02-15T22:08:32 1771193312

Lots and lots of red LEDs. Such an iconic machine! I miss computers that look good.

BTW, IBM has been doing a fine design job with their quantum computers - they aren’t quite the revolution we were promised, but they do look the part.

dpe82 · 2026-02-15T19:22:21 1771183341

There was a sockets API though (https://en.wikipedia.org/wiki/Winsock) and IIRC we all used Trumpet Winsock on Windows 3.1 with our dialup connections. But could have been 3.11 - my memory is a bit hazy.

rbanffy · 2026-02-15T22:07:03 1771193223

3.11 was so much nicer than 3.1 (and 3.0) I can’t imagine not moving to it as soon as possible.

aleph_minus_one · 2026-02-15T23:00:01 1771196401

Windows for Workgroups 3.11 did not contain Cardfile. :-(

rbanffy · 2026-02-15T23:38:31 1771198711

Didn’t it have a proper address book? I remember I could send faxes through Mail.

aleph_minus_one · 2026-02-15T23:43:26 1771199006

> Didn’t it have a proper address book?

Schedule+, which was contained in Windows for Workgroups 3.11, contained address book functionalities that were clearly better than Cardfile.

But people used Cardfile for many other different purposes than serving as an address book.

dpe82 · 2026-02-15T22:58:26 1771196306

I was like 11 at the time buying computer stuff with lawn care earnings so I used whatever I could get my hands on. :)

dpe82 · 2026-02-14T08:03:12 1771056192

There's also a corollary to this: if the organization does not recognize some work as needed or useful, you could well be actively wasting your time putting effort into it. There might be a good reason the company doesn't care that you just don't see, and leadership could be (at best) confused about why you would spend time on it.

nine_k · 2026-02-14T09:02:41 1771059761

Given enough soft skills, you can persuade your boss that what you are doing is important, and help him/her represent the department as uncovering and proactively addressing an important issue. Ideally it should align well with the boss's boss agenda.

dpe82 · 2026-02-14T09:18:07 1771060687

For sure, but sometimes what you or I think should be important really isn't in the grand scheme of things. An example could be focusing on cost or efficiency - generally very reasonable things to care about - but if all a company cares about right now is growth at all costs, then that focus would be wrong. This can happen - the company leadership might see a market that they absolutely must enter and be dominant in no matter the cost. That may not filter down well 3-4 layers of management; so the soft skill in that instance would be in sussing out what several layers of management above you actually care about and surfacing to them things that align with those concerns.

dpe82 · 2026-02-14T05:36:18 1771047378

That's a very difficult metric to measure whereas "did this user return and continue paying" is easier. The tyranny of metrics in action.

samrus · 2026-02-14T11:42:33 1771069353

Man i hate metrics sometimes. Important things that are hard to measure are just left by the wayside

dpe82 · 2026-02-13T18:59:13 1771009153

ChromeOS is a really great option for "just want to read emails and browse the web".

pjerem · 2026-02-13T20:36:27 1771014987

Oh yeah, at least with ChromeOS, Chrome isn’t installing itself like a spyware alongside any other software installer.