It's still baffling to me that the various API providers don't let us upload our custom grammars. It would enable so many use cases, like HTML generation for example, at essentially no cost on their part.
There are some implementation concerns, but the real answer is that it is an ideological choice.
The AI companies believe that these kinds of grammar mistakes will be solved by improving the models. To build out tools for grammar constrained inference like this is to suggest, on some level, that GPT-N+1 won't magically solve the problem.
The deeper level is that it's not just simple grammar constraints. Constraining to JSON is a nice party trick, but it opens the door to further ideas. How about constraining to a programming language's grammar? Those are well defined, you just swap the JSON grammar file for the Java grammar file, job done.
We can go further: Why not use a language server to constrain not only the grammar but also the content? What variables and functions are in-scope is known, constraining a variable reference or function call to one of their names can be done with the same techique as grammar constraints. ("monitor-guided decoding", figured out back in 2023)
Entire classes of hallucination problems can be eliminated this way. The marketing writes itself; "Our AI is literally incapable of making the errors humans make!"
What many AI developers, firms, and especially their leaders find grating about this is the implication. That AI is fallible and has to be constrained.
Another such inconvenience is that while these techniques improve grammar they highlight semantic problems. The code is correct & compiles, it just does the wrong thing.
One pattern that I've seen develop (in PydanticAI and elsewhere) is to constrain the output but include an escape hatch. If an error happens, that lets it bail out and report the problem rather than be forced to proceed down a doomed path.
You don't need a new model. The trick of the technique is that you only change how tokens are sampled; Zero out the probability of every token that would be illegal under the grammar or other constraints.
All you need for that is an inference API that gives you the full output vector, which is trivial for any model you run on your own hardware.
Using grammar constrained output in llama.cpp - which has been available for ages and I think is a different implementation to the one described here - does slow down generation quite a bit. I expect it has a naive implementation.
As to why providers don't give you a nice API, maybe it's hard to implement efficiently.
It's not too bad if inference is happening token by token and reverting to the CPU every time, but I understand high performance LLM inference uses speculative decoding, with a smaller model guessing multiple tokens in advance and the main model doing verification. Doing grammar constraints across multiple tokens is tougher, there's an exponential number of states that need precomputing.
So you'd need to think about putting the parser automaton onto the GPU/TPU and use it during inference without needing to stall a pipeline by going back CPU.
And then you start thinking about how big that automaton is going to be. How many states, pushdown stack. You're basically taking code from the API call and running it on your hardware. There's dragons here, around fair use, denial of service etc.
There's also a grammar validation tool in the default llama.cpp build, which is much easier to reason about for debugging grammars than having them bounce off the server.
Wouldn't that have implications for inference batching, since you would have to track state and apply a different mask for each sequence in the batch? If so, I think it would directly affect utilisation and hence costs. But I could be talking out of my ass here.
I mean, most don't? I know you can provide a pseudo-EBNF grammar to llama.cpp but, for example, none of Anthropic, Azure, Bedrock, Mistral or Gemini allow us the same.