I really like BAML but this post seems a little too much like a BAML funnel. Here are three methods that worked for me consistently since constrained sampling first came out:
1. Add a validation step (using a mini model) right at the beginning - sub-second response times; the validation will either emit True/False or emit a function call
2. Use a sequence of (1) large model without structured outputs for reasoning/parsing, chained to (2) small model for constrained sampling/structured output
3. Keep your Pydantic models/schemas as flat (not too nested and not too many enumarations) and "help" the model in the system prompt as much as you can
I agree with you (I have reviewed papers in the past), however, made-up citations are a "signal". Why would the authors do that? If they made it up, most likely they haven't really read that prior work. If they haven't, have they really done proper due dilligence on their research? Are they just trying to "beef up" their paper with citations to unfairly build up credibility?
These are fantastic insights! I work in legaltech space so something to keep in mind is that legal space is very sensitive to data storage and security (apart from this of course: https://alexschapiro.com/security/vulnerability/2025/12/02/f...). So models hosted in e.g. Azure, or on-prem deployments are more common. I have friends in health space and similar story there. Finance (banking especially) is the same. Hence why those categories look more or less constant over time, and have smallest contributions in this study.
I think there is this troika of "Leadership", "Management" and "Followship". You don't have to be an engineering manager to be a leader, and just because you are a leader doesn't mean you have any "followers". As someone who's been a team lead, a tech lead, an EM, and a C-level, I feel the goal is to hit that balance between those three. You want to embody a leader by actually being technically great, visionary, empathetic, and leading by example; but you also want to manage people and expectations; and ultimately you want people to follow you - to basically say "I love working for/with this person". Finding this triangulation is essentially what makes you timeless and relevant no matter the fad.
I couldn't immediately see in their graphs/tables any comparison against simple lexical/statistical based context compression, such as candidate selection of chunks using TF-IDF, word overlap etc. For most of us in the industry we need to find these quick wins that give us equivalent performance to sending huge amount of information to the LLM, while compressing by 10x.
One of the most interesting mathematical aspects to me are the fact that LLMs are logit emitters. And associated with this output is uncertainty. Lot of ppl talk about networks of agents. But what you are doing is accumulating uncertainty - every model in the chain introduces its own uncertainty on top of what it inherits. In some situations I've seen a complete collapse after 3 LLM calls chained together. Hence why lot of people recommend "human in the loop" as much as possible to try and reduce that uncertainty (shift the posterior if you will); or they recommend more of a workflow approach - where you have a single orchestrator that decides which function to call, and most of the emphasis (and context engineering) is placed on that orchestrator. But it all ties together in the maths of LLMs.
When you use ChatGPT and it executes code, i.e. when you tell it to do something with a CSV file, it seems to run in a VM with certain tools and libraries available to it, and a sandboxed disk access; no internet access though. So it's kind of already there.
`to=python.exec` is how it runs python code, and `to=container.exec` is how it runs bash commands, with attached files showing up in `/many/data`. Unfortunately the stdout is heavily truncated prior to being shown to the model so it's not a hack for longer context via printing file attachment's contents.
Now imagine you run two AIs (like ChatGPT) on your machine or on a server. You maybe even want them to cooperate on something. How do you do that? Right, you cannot, there is no standard, no interoperability, nothing.
1. Add a validation step (using a mini model) right at the beginning - sub-second response times; the validation will either emit True/False or emit a function call
2. Use a sequence of (1) large model without structured outputs for reasoning/parsing, chained to (2) small model for constrained sampling/structured output
3. Keep your Pydantic models/schemas as flat (not too nested and not too many enumarations) and "help" the model in the system prompt as much as you can
reply