Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When doing structured sampling, why is the token sampled, checked against the grammar, and resampled if it's wrong by applying the mask ?

Why wouldn't we apply the mask immediately for the first sampling? Is this an optimization somehow, is masking expensive?



If you can screen tokens against your grammar fast enough, you can build a bitmask over the entire token vocabulary and apply it right before sampling. As vocabulary sizes grow, this gets more complex to do in real time, but we (and other libraries) have found several optimizations to do this extremely quickly (eg for guidance, we detail some optimizations here https://github.com/guidance-ai/llguidance/blob/main/docs/opt...).

Other libraries work by essentially pre-computing all the masks for all possible generations, but of course you're restricted to working with simple grammars in this case (like a subset of regular expressions)


Implementation preference.

> is masking expensive?

It's not expensive per-se; A single element-wise multiplication of the output vector.

The real "expense" is that you need to prepare masks for every element of your grammar as they are expensive to recompute as needed; LLM tokens do not cleanly map onto elements of your grammar. (Consider JSON: LLM tokens often combine various special characters such as curly braces, colons, and quotes.)

This isn't that hard to compute, it's just more work to implement.


Good question—some frameworks do apply the mask immediately, others defer for performance or implementation simplicity. Mask precomputation can get tricky with large vocabularies, especially if grammar elements span multiple tokens. Immediate masking is usually preferred, but optimizations kick in when you're juggling complicated grammars or working against throughput bottlenecks.


Hey! I'm the author of the post. We haven't optimized sampling yet so it's running linearly on the CPU. A lot of SOTA work either does this while the model is running the forward pass or does the masking on the GPU.

The greedy accept is so that the mask doesn't need to be computed. Planning to make this more efficient from either ends.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: