Thanks for your response. Personally I fall into the "strings are arrays of byte...

dralley · on Nov 27, 2023

>I would be fine having a separate Unicode string type in the standard library for those instances when you really need Unicode; this design makes the common case much simpler at the expense of making the rare case harder.

Even as a native English speaker, I'm extremely uncomfortable with the idea that we're going to make software even more difficult to internationalize than it already is by using completely separate types for ASCII/Latin1-only text and Unicode.

And it's a whole different level of Anglocentric to portray non-English languages as the "rare" case.

pests · on Nov 27, 2023

If you give an input box to an American I promise you an emoji will find it's way into it no matter what it's for.

nindalf · on Nov 27, 2023

So much this. Thinking that only America and the UK matter is something that was forgivable 40 years ago but not today. It’s even more bizarre because of what you point out - emojis don’t make sense if you consider them as single byte arrays. And lastly, even if you only consider input boxes that don’t accept emojis like names or addresses, you have to remember that America is a nation of immigrants. A lot of folks have names that aren’t going to fit in ASCII.

And this stuff actually matters! In a legal, this-will-cost-us-money kind of way! In 2019 a bank in the EU was penalised because they wouldn’t update a customer’s name to include diacritics (like á, è, ô, ü, ç). Their systems couldn’t support the diacritics because it was built in the 90s with an encoding invented in the 60s. Not their fault but they were still penalised. (https://shkspr.mobi/blog/2021/10/ebcdic-is-incompatible-with...)

It is far more important that strings be utf-8 encoded than they be indexable like arrays. Rust gets this right and I hope future languages will too.

pests · on Nov 27, 2023

"i must have indexable strings for performance reasons. oh, btw its an electron app"

matheusmoreira · on Nov 27, 2023

Unicode has such a rich collection of symbols. I use them frequently in code comments.

PaulDavisThe1st · on Nov 27, 2023

such a different level, you could even call it Latincentric :)

bschwindHN · on Nov 27, 2023

> this design makes the common case much simpler at the expense of making the rare case harder

In the age of emoji (and uhhhh, everyone who doesn't use English as their main language), I don't think your "rare case" is really that rare.

Exoristos · on Nov 27, 2023

There must be statistics around how much of the data in the world is or could be Latin-1. I'm going to guess it's a very high percentage.

pests · on Nov 27, 2023

wat?

Counting webpages (not bytes) the number is near 50% are English only and 50% are other languages.

Only 20% of people worldwide speak English.

I know other languages can fit inside ASCII but, really?

avgcorrection · on Nov 27, 2023

> In the age of emoji (and uhhhh, everyone who doesn't use English as their main language)

Open a random NYT article of more than a hundred words. It won’t be in ASCII.

o11c · on Nov 27, 2023

Do note that emoji are a case where the "codepoint" approach fails catastrophically.

gautamcgoel · on Nov 29, 2023

Can you please elaborate?

xwolfi · on Nov 27, 2023

On paper you're not wrong, but String used for localized text are a special subclass you can deal with separately. Most Strings that will cause you problems are, you know, technical: logs, name of subsystems, client ids, client-sourced API-provided values which change format across client etc. Those, in my experience, are always ASCII even in China, exactly because nobody wants to deal with too much crap.

Display Strings are simpler to manipulate in most cases: load String from file or form, store back verbatim in DB or memory, you barely do anything other than straight copying, right ?

The way I do in Java is that I always assume and enforce my strings to be ASCII single byte, and if I want to display something localized, somehow, it never really goes through any complex logic where I need to know the encoding: I copy the content with an encoding metadata, and the other side just displays it.

saghm · on Nov 26, 2023

"strings are arrays of bytes" combined with the assumption that "characters are a single byte" sounds basically the same as the "array of code points" that the parent comment is disagreeing with

foolswisdom · on Nov 27, 2023

Code points are not bytes.

saghm · on Nov 27, 2023

Sure, but if you're insisting that the string be represented as one byte per character, you end up with the exact same properties with "array of code points" and "array of bytes"

bruce511 · on Nov 27, 2023

Sort-of, but no, because code points are not characters.

There's a big difference between "get the 5th code point" and "get the 5th character".

Because multiple code points can be used in a single character, it not possible to do random-character-access in a unicode-encoded string.

foolswisdom · on Nov 27, 2023

No, it's impossible to do random access to retrieve a character, if you are dealing with code points, because code points do not have a fixed byte size. I thought this a good intro <https://tonsky.me/blog/unicode/>.

matheusmoreira · on Nov 27, 2023

> For example, the number of characters in the string is simply the length of the string.

For I/O you need the amount of bytes it occupies in memory and that's always known.

For text processing, you don't actually need to know the length of the text. What you actually need is the ability to determine the byte boundaries between each code point and most importantly each grapheme cluster.

> when you really need Unicode

You always need Unicode. Sorry but it's almost 2024 and I shouldn't even have to justify this.

For I/O, you don't need "strings" at all, you need byte buffers. For text, you need Unicode and everything else is just fundamentally wrong.

avgcorrection · on Nov 27, 2023

> Obviously that makes internationalization harder, but the advantage is that strings are much simpler to reason about.

Internationalization relative to what? Anyway, just pick any language in the world, i.e. an arbitrary one—can you represent it using just ASCII? If so I would like to know what language that is. It seems that Rotokas can be.[1] That’s about 5K speakers. So you can make computer programs for them.

Of course this comment of mine isn’t ASCII-only.

[1] https://en.wikipedia.org/wiki/Rotokas_language

gautamcgoel · on Nov 29, 2023

Which non-ASCII character did you use?

growse · on Nov 27, 2023

Out of interest, would you also say that "images are arrays of bytes"?

If not, what's the semantic difference?

For me, strings represent text, which is fundamentally linked to language (and all of its weird vagueness and edge-cases). I feel like there's a "Fallacies programmers believe about text" that should exist somewhere, containing items like "text has a defined length" and "two identical pieces of text mean the same thing".

So whilst it's nice to have an implementation that lets you easily "seek to the 5th character", it's not always the case that this is a well defined thing.

matheusmoreira · on Nov 27, 2023

> I feel like there's a "Fallacies programmers believe about text" that should exist somewhere

I got you covered.

https://github.com/kdeldycke/awesome-falsehood#international...

http://garbled.benhamill.com/2017/04/18/falsehoods-programme...

https://jeremyhussell.blogspot.com/2017/11/falsehoods-progra...

https://wiesmann.codiferes.net/wordpress/archives/30296

I love when the writing gets visibly more unhinged and frustrated with each invalidated assumption. It's like the person's mind is desperately trying to find some small sliver of truth to hold onto but it can't because the rug is getting constantly pulled out from under it.

svat · on Nov 27, 2023

See also: Text Rendering Hates You https://faultlore.com/blah/text-hates-you/ and Text Editing Hates You Too https://lord.io/text-editing-hates-you-too/

growse · on Nov 27, 2023

Amazing, thanks. Bookmarking the lot.