I read this as colloquial English for "around" or "approximately". Not setting a...

rcoveson · on June 13, 2023

But you also read it as referring to ZWJ sequences? So the author has picked a number that is actually below average and they've worded it as up to...?

Saying a ZWJ sequence can be "up to 5 bytes" is like saying "the current generation of Intel processors run at clock speeds of up to 2 GHz".

If they were referring to ZWJ sequences (I don't think they were; I think they were just misremembering the maximum encoded length of a codepoint) and they had said "up to 35 bytes", then I might agree with you. It's still not technically accurate, but it's a reasonable colloquial usage, like saying "human males can grow up to seven feet tall".

WorldMaker · on June 13, 2023

I think you are trying to read something that wasn't meant to be technical documentation as if it was trying to be exact technical specifications. I'm not the original author, so I don't have reason to litigate this any further, and I'm not sure what you are arguing about at this point.

rcoveson · on June 13, 2023

You've now replied to me up to 2 times.

z3t4 · on June 14, 2023

Sorry I meant codepoint/characters, but it would not suprise me if there existed an encoding or language where my wording would be technically correct, but I do not know of any such encoding. I also did not know that there exist more then 5 combinations in Unicode, but I'm not supprised and my implementation is probably buggy. But I do challange you to test how well your favourite editor (terminal emulator cough) handles Unicode emojis.

WorldMaker · on June 14, 2023

UTF-8's original specification included 5-byte and 6-byte encodings to cover the complete astral plane (31-bit code points), but later specifications have marked those "invalid" today due to the current 21-bit limit of UTF-16 and to align both specifications for now rather than fix the bugs in UTF-16 (or scratch UTF-16 altogether). In theory, UTF-8 can even extend beyond 6-byte encodings (and UTF-32 into 8-byte encodings and beyond) if the next plane (63-bit code points) or the one after that ever needed to open up. (No one expects that any time soon, of course. Today Unicode is nowhere close to in danger of filling 21-bits much less 31. That would be a massive shock and the compatibility headache would be terrible with UTF-16 breaking or today's software breaking that hard codes the assumption that UTF-8 should never go past 4-byte encodings.)

rcoveson · on June 14, 2023

If it wouldn't surprise you then I think you should recalibrate your feelings about how surprising Unicode encodings are. There aren't very many of them, they haven't changed in a very long time, and they don't deal with any of the stuff that makes Unicode very complicated (collation, combination characters, etc). They just encode 21-bit integers, albiet sometimes in a highly convoluted way for backwards-compatibility reasons (UTF-16). It's not the kind of thing that needs to be estimated, or where a layer of FUD is warranted (as it kind of is with combination characters). When talking of codepoints, it's just "up to 4 bytes", high confidence, nothing more to it.