But also it doesn't see characters. It sees tokens. The only way it would be rel...

yousif_123123 · on March 1, 2024

Even if it sees tokens, I don't think it's an impossible task. Certainly an advanced enough LLM should be able to decipher token meanings, to know that a word is made up of the individual character tokens regardless of how the full word is tokenized. Maybe something gpt5 can do (or there's some real technical limitation which I don't understand)

Izkata · on March 1, 2024

> to know that a word is made up of the individual character tokens

A token is the smallest unit, it's not made of further tokens. It maps to a number.

nvader · on March 1, 2024

I think what of is getting at is that given

{the:1, t: 2, h:3, e:4}

There should be somewhere in the corpus, "the is spelled t h e" that this system can use to pull this out. We can ask gpt to spell out individual words in NATO phonetic and see how it does.

littlestymaar · on March 1, 2024

> There should be somewhere in the corpus, "the is spelled t h e" that this system can use to pull this out.

Such an approach would require an enormous table, containing all written words, including first and last names, and would still fail for made up words.

A more tractable approach would be to give it the map between the individual tokens and their letter component, but then you have the problem that this matching depends on the specific encoding used by the model (it varies between models). You could give it to the model during fine-tuning though.

mewpmewp2 · on March 1, 2024

The best approach would be to instruct it to under the hood call a function for such asks and hide the fact that it called a function.

pests · on March 1, 2024

He's saying the LLM will figure out how many letters are in each token.

littlestymaar · on March 1, 2024

They cannot “figure” it, they could learn it but for that it would need to be in it's training data (which isn't because nobody is writing down the actual pairing in every byte pair encoding in plain text. Also the LLM has no clue about what encoding it uses unless you tell it somehow in the fine-tuning process or the prompt.)

wruza · on March 1, 2024

It's as feasible as telling how many chars in html lead to this comment by looking at a screenshot. LLM doesn't see characters, tokens, numbers or its own activations. LLM is a "set of rules" component in a chinese room scenario. Anything an operator of that room does is lower-level.

GGP's idea suggests that an LLM, allegedly as a whole-room, receives something like: "hey, look at these tokens: <tokens>, please infer the continuation". This puts it into a nested-room's-operator position, which (1) it is not, (2) there's no nested room.

mewpmewp2 · on March 1, 2024

The point is though that this is definitely not a task to evaluate an LLMs intelligence with. It's kind of like laughing at Einstein when he wouldn't be able to decipher hieroglyphs in any language without previous training for those hieroglyphs. Could Einstein potentially learn those hieroglyphs? Sure. But is it the best use of his time - or memory space?