Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

And even then, in some languages at least, what constitutes a grapheme isn't always well defined.


> in some languages at least, what constitutes a grapheme isn't always well defined.

Can you provide some examples? People say this a lot, but the cases I've been able to find tend to be things like U+01F1 LATIN CAPITAL LETTER DZ, which is only not well defined in the sense that Unicode defines it wrong (as one character rather than two) presumably-on-purpose, for compatibility with one or more older character encodings.


I don’t know about DZ, but things that are two letters in one language can be one in another (famously ij in Dutch, but there are others).

Also, things like æ and œ are much easier to deal with if they are a single glyphs. Their upper-case versions are respectively Æ and Œ even in languages where they are two letters. I suppose that now we would do it with something like ZWJs to make sure that both letters are transformed consistently but there are technical reasons behind the current situation.

[edit] here you go: dz is the 7th letter of the Hungarian alphabet, but the capital version (in a normal sentence) of dz is Dz and not DZ. Yeah, languages are weird: https://en.m.wikipedia.org/wiki/Hungarian_alphabet


> famously ij in Dutch

ij is considered two letters in Dutch, although they go by special rules: https://onzetaal.nl/taalloket/ij-plaats-in-alfabet


Is DZ "wrong" because it's not considered a digraph by professionals, or because people don't agree that digraphs should be considered single characters?


"DZ" isn't 'wrong', it's a perfectly valid two-character string consisting of "D" followed by "Z". Assigning to a multi-character string a encoded representation that isn't the concatenation of representations of each character in the string (especially while insisting that that makes it a distinct character in its own right) is what's wrong.


But the letter DZ is not the same thing as the letter D followed by the letter Z. It's a standalone letter in e.g. Hungarian or Slovak, much like æ isn't "ae".


True - I was thinking of Unicode's definition ("[extended] grapheme clusters").




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: