This sounds odd, why would you feed the LLM text as bytes instead of characters?

yorwba · 2025-04-05T12:06:18 1743854778

Because if you start with characters, much of the token vocabulary would be dedicated to rare Chinese characters right off the bat. If you start from UTF-8 bytes, you can dedicate more token space to common sequences of multiple characters (i.e. words people actually use) and achieve much better compression ratios.

RedNifre · 2025-04-05T13:24:16 1743859456

I don't understand. Why would much of the vocabulary be dedicated to rare Chinese characters? Wouldn't those need to show up in the training data first? And if they did, wouldn't they also show up as weird byte sequences? And aren't UTF-8 byte sequences kinda risky for everything other than ASCII, since only ASCII bytes and header bytes are unambiguous, whereas following bytes (10***) are very ambiguous individually? I mean, sure, the LLM would notice that their meaning changes depending on preceding following- and header-bytes, but it is still not clear to me, why UTF-8 bytes are better for LLMs than characters (or even grapheme clusters). UTF-8 bytes seem like a very arbitrary choice to me. Why not do UTF-9 instead and get the most important Latin letters as single ninebitbytes?

yorwba · 2025-04-05T14:48:33 1743864513

Yes, rare Chinese characters do show up in the training data (the rarest of them at least appear in lists of characters) and yes, they get tokenized as weird byte sequences, making the model work harder to process them, but it's better for that to happen to rare characters than to common words. It's a tradeoff.

And of course UTF-8 is unlikely to be the single best encoding (e.g. Anthropic has a tokenizer that turns all caps text into a special caps lock symbol plus the regular-case equivalent) but much of it is papered over by byte-pair encoding. E.g. the most important Latin letters appear often enough that they get dedicated tokens anyways.

RedNifre · 2025-04-05T16:35:46 1743870946

Thanks, makes sense.

zellyn · 2025-04-07T12:35:38 1744029338

Andrej Karpathy goes into quite some depth about how tokenization works here: https://www.youtube.com/watch?v=zduSFxRajkE

tl;dr many of the LLMs use byte-pair encoding to create tokens. You take a set of documents, and then form tokens by repeatedly merging the most common pair of tokens. The initial set of tokens is 256 raw bytes. And the text is typically represented in utf-8.

I expect that although the LLMs can understand arbitrarily but cleanly offset unicode code points by (eventually) noticing the final byte of each sequence, they would do markedly worse on actually processing and completing on them, because they will not have been reduced to the normal set of tokens. However, if the text is actually output cleanly converted, either in internal thinking tokens or in the beginning of the response, they should do fine.

Understanding tokenization is surprisingly useful, even if that video seems awfully long to devote to such a tedious subject. Even Karpathy doesn't like it!

xg15 · 2025-04-05T13:23:14 1743859394

For reference, this was the thread where someone explained that to me (from 5 months ago) : https://news.ycombinator.com/item?id=41849759

RedNifre · 2025-04-05T13:29:30 1743859770

Oh, that's interesting! It sounds like it's not literally being fed UTF-8 bytes, but instead more like this: For rarely seen characters, it's two tokens, namely first a codeblock token ("Tag" token in this case), followed by a token like "1st character in this codeblock" or "2nd character in this code block" and so on and since many rare codeblocks are latin-like (tags, circled letters, mathematical Fraktur variables etc.), the LLM picks up that "some block token"+"1st character in the codeblock" kinda is like "A"? Is that how it works?

xg15 · 2025-04-05T13:40:27 1743860427

Had to read it again as well but yeah, that's how I'd understand it too. So the "offset in block" tokens are still not the same tokens as for the "real" ASCII letters, but they are the same tokens for all "weird ascii-like Unicode blocks". So the model can aggregate the training data from all those blocks and automatically "generalize" to similar characters in other blocks (by learning to ignore the "block identifier" tokens) even ones that have very little or no training examples themselves.

Edit: So this means if you want to sanitize text before passing it to an LLM, you don't only have to consider standard Unicode BMP characters but also everything that mirrors those characters in a different block. And because models can do Cesar ciphers with small offsets, possibly even blocks where the characters don't line up completely but are shifted by a small number.

Maybe it would be better to run the sanitizer on the tokens or even the embedding vectors instead of the "raw" text.