Thanks for your response. Personally I fall into the "strings are arrays of bytes" camp (which is also shared by Go). A difference between my view and that of the Go designers is that I don't feel that it is important to support Unicode by default and am perfectly happy to assume that every character corresponds to a single byte. Obviously that makes internationalization harder, but the advantage is that strings are much simpler to reason about. For example, the number of characters in the string is simply the length of the string. I would be fine having a separate Unicode string type in the standard library for those instances when you really need Unicode; this design makes the common case much simpler at the expense of making the rare case harder. I also don't see that mutability is such a huge deal unless you absolutely insist that your language support string interning.
>I would be fine having a separate Unicode string type in the standard library for those instances when you really need Unicode; this design makes the common case much simpler at the expense of making the rare case harder.
Even as a native English speaker, I'm extremely uncomfortable with the idea that we're going to make software even more difficult to internationalize than it already is by using completely separate types for ASCII/Latin1-only text and Unicode.
And it's a whole different level of Anglocentric to portray non-English languages as the "rare" case.
So much this. Thinking that only America and the UK matter is something that was forgivable 40 years ago but not today. It’s even more bizarre because of what you point out - emojis don’t make sense if you consider them as single byte arrays. And lastly, even if you only consider input boxes that don’t accept emojis like names or addresses, you have to remember that America is a nation of immigrants. A lot of folks have names that aren’t going to fit in ASCII.
And this stuff actually matters! In a legal, this-will-cost-us-money kind of way! In 2019 a bank in the EU was penalised because they wouldn’t update a customer’s name to include diacritics (like á, è, ô, ü, ç). Their systems couldn’t support the diacritics because it was built in the 90s with an encoding invented in the 60s. Not their fault but they were still penalised. (https://shkspr.mobi/blog/2021/10/ebcdic-is-incompatible-with...)
It is far more important that strings be utf-8 encoded than they be indexable like arrays. Rust gets this right and I hope future languages will too.
On paper you're not wrong, but String used for localized text are a special subclass you can deal with separately. Most Strings that will cause you problems are, you know, technical: logs, name of subsystems, client ids, client-sourced API-provided values which change format across client etc. Those, in my experience, are always ASCII even in China, exactly because nobody wants to deal with too much crap.
Display Strings are simpler to manipulate in most cases: load String from file or form, store back verbatim in DB or memory, you barely do anything other than straight copying, right ?
The way I do in Java is that I always assume and enforce my strings to be ASCII single byte, and if I want to display something localized, somehow, it never really goes through any complex logic where I need to know the encoding: I copy the content with an encoding metadata, and the other side just displays it.
"strings are arrays of bytes" combined with the assumption that "characters are a single byte" sounds basically the same as the "array of code points" that the parent comment is disagreeing with
Sure, but if you're insisting that the string be represented as one byte per character, you end up with the exact same properties with "array of code points" and "array of bytes"
No, it's impossible to do random access to retrieve a character, if you are dealing with code points, because code points do not have a fixed byte size. I thought this a good intro <https://tonsky.me/blog/unicode/>.
> For example, the number of characters in the string is simply the length of the string.
For I/O you need the amount of bytes it occupies in memory and that's always known.
For text processing, you don't actually need to know the length of the text. What you actually need is the ability to determine the byte boundaries between each code point and most importantly each grapheme cluster.
> when you really need Unicode
You always need Unicode. Sorry but it's almost 2024 and I shouldn't even have to justify this.
For I/O, you don't need "strings" at all, you need byte buffers. For text, you need Unicode and everything else is just fundamentally wrong.
> Obviously that makes internationalization harder, but the advantage is that strings are much simpler to reason about.
Internationalization relative to what? Anyway, just pick any language in the world, i.e. an arbitrary one—can you represent it using just ASCII? If so I would like to know what language that is. It seems that Rotokas can be.[1] That’s about 5K speakers. So you can make computer programs for them.
Out of interest, would you also say that "images are arrays of bytes"?
If not, what's the semantic difference?
For me, strings represent text, which is fundamentally linked to language (and all of its weird vagueness and edge-cases). I feel like there's a "Fallacies programmers believe about text" that should exist somewhere, containing items like "text has a defined length" and "two identical pieces of text mean the same thing".
So whilst it's nice to have an implementation that lets you easily "seek to the 5th character", it's not always the case that this is a well defined thing.
I love when the writing gets visibly more unhinged and frustrated with each invalidated assumption. It's like the person's mind is desperately trying to find some small sliver of truth to hold onto but it can't because the rug is getting constantly pulled out from under it.