Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> On a character level this should be trivial.

Characters are not the semantic components of words—these are syllables. Generally speaking, anyway. I've got to imagine this approach would yield higher quality results than the roman alphabet. I'm curious if this could be tested by just looking at how LLMs handle English vs Chinese.



The minimal semantic parts of words are morphemes. Syllables are phonological units (roughly: the minimal unit for rhythmic purposes such as stress, etc)


Only in languages that have morphemes! This is hardly a universal attribute of language so much as an attribute of those that use an alphabet to encode sounds. It makes more sense to just bypass the encoding and directly consider the speech.

Besides, considering morphemes as semantic often results in a completely different meaning than we actually intend. We aren't trying to train a chatbot to speak in prefixes and suffixes, we're trying to train a chatbot to speak in natural language, even if it is encoded to latin script before output.


That's technically wrong. Every language has morphemes for the simple reason that every word is at least one morpheme. `cat` is a morpheme. `cats` is two morphemes (cat-s).

(The point about semantics is also technically wrong. You would first need to specify your view of semantic compositionality before such a point can be evaluated, but the usual views of semantics don't have any such consequence.)


> Every language has morphemes for the simple reason that every word is at least one morpheme.

Sure, if you define "morpheme" as a collection of syllables that's meaningful to people using alphabetic script. I don't see any benefit to this compared to working with syllables directly, which is a meaningful concept regardless of the script used to encode them.


> Sure, if you define “morpheme” as a collection of syllables

Cats, as noted, has two morphemes, despite having only one syllable. Syllables and morphemes are largely orthogonal, morphemes can be less than, equal to, or more than a syllable (and even when more than, may or may not start or end on a syllable boundary.)

(Also, syllables aren’t the minimal semantic units even of spoken speech, those are phonemes – a syllable consists of at least one phoneme, potentially more. But morphemes, even an alphabetic script if it isn’t perfectly phonetic, still don’t necessarily map to one or more phonemes, since is textual semantic unit may have no effect on pronunciation.)


You might not see any benefit, but that's what those words mean :) Grab any textbook, it is linguistics 101!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: