More

armcat · 2025-12-21T17:33:26 1766338406

I really like BAML but this post seems a little too much like a BAML funnel. Here are three methods that worked for me consistently since constrained sampling first came out:

1. Add a validation step (using a mini model) right at the beginning - sub-second response times; the validation will either emit True/False or emit a function call

2. Use a sequence of (1) large model without structured outputs for reasoning/parsing, chained to (2) small model for constrained sampling/structured output

3. Keep your Pydantic models/schemas as flat (not too nested and not too many enumarations) and "help" the model in the system prompt as much as you can

armcat · 2025-12-10T12:17:32 1765369052

Hi Simon! Love your work! Our of curiosity - how many pelican-cycling samples do you produce. Curious about the variance here. Thanks!

simonw · 2025-12-10T13:09:43 1765372183

I've lost count, but there are 85 posts with that tag here: https://simonwillison.net/tags/pelican-riding-a-bicycle/

I need to extract them all into a formal collection.

karambir · 2025-12-10T13:40:20 1765374020

I think the parent poster might be asking about generations per model-test. Atleast that's what I understood.

huxley · 2025-12-10T16:18:17 1765383497

A coffee-table book? A Natural History of SVG Pelicans

armcat · 2025-12-07T19:06:41 1765134401

I agree with you (I have reviewed papers in the past), however, made-up citations are a "signal". Why would the authors do that? If they made it up, most likely they haven't really read that prior work. If they haven't, have they really done proper due dilligence on their research? Are they just trying to "beef up" their paper with citations to unfairly build up credibility?

armcat · 2025-12-05T06:28:39 1764916119

These are fantastic insights! I work in legaltech space so something to keep in mind is that legal space is very sensitive to data storage and security (apart from this of course: https://alexschapiro.com/security/vulnerability/2025/12/02/f...). So models hosted in e.g. Azure, or on-prem deployments are more common. I have friends in health space and similar story there. Finance (banking especially) is the same. Hence why those categories look more or less constant over time, and have smallest contributions in this study.

armcat · 2025-11-23T22:43:02 1763937782

I think there is this troika of "Leadership", "Management" and "Followship". You don't have to be an engineering manager to be a leader, and just because you are a leader doesn't mean you have any "followers". As someone who's been a team lead, a tech lead, an EM, and a C-level, I feel the goal is to hit that balance between those three. You want to embody a leader by actually being technically great, visionary, empathetic, and leading by example; but you also want to manage people and expectations; and ultimately you want people to follow you - to basically say "I love working for/with this person". Finding this triangulation is essentially what makes you timeless and relevant no matter the fad.

armcat · 2025-11-06T11:55:23 1762430123

One mistake in your README - groq throughput is actually 1000 tokens per "second" (not "minute"), for gpt-oss-20b.

iagooar · 2025-11-06T12:03:26 1762430606

Nice catch - fixed!

armcat · 2025-10-12T07:53:57 1760255637

I couldn't immediately see in their graphs/tables any comparison against simple lexical/statistical based context compression, such as candidate selection of chunks using TF-IDF, word overlap etc. For most of us in the industry we need to find these quick wins that give us equivalent performance to sending huge amount of information to the LLM, while compressing by 10x.

armcat · 2025-09-28T20:49:03 1759092543

Can second Regression and Other Stories, it's freely available here: https://users.aalto.fi/~ave/ROS.pdf, and you can access additional information such as data and code (including Python and Julia ports) here: https://avehtari.github.io/ROS-Examples/index.html

armcat · 2025-09-06T13:41:55 1757166115

One of the most interesting mathematical aspects to me are the fact that LLMs are logit emitters. And associated with this output is uncertainty. Lot of ppl talk about networks of agents. But what you are doing is accumulating uncertainty - every model in the chain introduces its own uncertainty on top of what it inherits. In some situations I've seen a complete collapse after 3 LLM calls chained together. Hence why lot of people recommend "human in the loop" as much as possible to try and reduce that uncertainty (shift the posterior if you will); or they recommend more of a workflow approach - where you have a single orchestrator that decides which function to call, and most of the emphasis (and context engineering) is placed on that orchestrator. But it all ties together in the maths of LLMs.

armcat · 2025-08-30T15:53:57 1756569237

When you use ChatGPT and it executes code, i.e. when you tell it to do something with a CSV file, it seems to run in a VM with certain tools and libraries available to it, and a sandboxed disk access; no internet access though. So it's kind of already there.

Ancapistani · 2025-08-30T17:03:17 1756573397

That’s also how Devin works, and OpenHands.

The agent running in a VM - at least by default - was a key feature during the AI pilot I ran a few months ago.

hhh · 2025-08-30T17:04:04 1756573444

It runs in Kubernetes, on AKS (Azure Kubernetes) with some gvisor stuff. Though Jupyter if i recall.

energy123 · 2025-08-30T17:40:24 1756575624

`to=python.exec` is how it runs python code, and `to=container.exec` is how it runs bash commands, with attached files showing up in `/many/data`. Unfortunately the stdout is heavily truncated prior to being shown to the model so it's not a hack for longer context via printing file attachment's contents.

valenterry · 2025-08-31T01:22:52 1756603372

Not really.

Now imagine you run two AIs (like ChatGPT) on your machine or on a server. You maybe even want them to cooperate on something. How do you do that? Right, you cannot, there is no standard, no interoperability, nothing.