>I would be fine having a separate Unicode string type in the standard library for those instances when you really need Unicode; this design makes the common case much simpler at the expense of making the rare case harder.
Even as a native English speaker, I'm extremely uncomfortable with the idea that we're going to make software even more difficult to internationalize than it already is by using completely separate types for ASCII/Latin1-only text and Unicode.
And it's a whole different level of Anglocentric to portray non-English languages as the "rare" case.
So much this. Thinking that only America and the UK matter is something that was forgivable 40 years ago but not today. It’s even more bizarre because of what you point out - emojis don’t make sense if you consider them as single byte arrays. And lastly, even if you only consider input boxes that don’t accept emojis like names or addresses, you have to remember that America is a nation of immigrants. A lot of folks have names that aren’t going to fit in ASCII.
And this stuff actually matters! In a legal, this-will-cost-us-money kind of way! In 2019 a bank in the EU was penalised because they wouldn’t update a customer’s name to include diacritics (like á, è, ô, ü, ç). Their systems couldn’t support the diacritics because it was built in the 90s with an encoding invented in the 60s. Not their fault but they were still penalised. (https://shkspr.mobi/blog/2021/10/ebcdic-is-incompatible-with...)
It is far more important that strings be utf-8 encoded than they be indexable like arrays. Rust gets this right and I hope future languages will too.
Even as a native English speaker, I'm extremely uncomfortable with the idea that we're going to make software even more difficult to internationalize than it already is by using completely separate types for ASCII/Latin1-only text and Unicode.
And it's a whole different level of Anglocentric to portray non-English languages as the "rare" case.