Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for your response. Personally I fall into the "strings are arrays of bytes" camp (which is also shared by Go). A difference between my view and that of the Go designers is that I don't feel that it is important to support Unicode by default and am perfectly happy to assume that every character corresponds to a single byte. Obviously that makes internationalization harder, but the advantage is that strings are much simpler to reason about. For example, the number of characters in the string is simply the length of the string. I would be fine having a separate Unicode string type in the standard library for those instances when you really need Unicode; this design makes the common case much simpler at the expense of making the rare case harder. I also don't see that mutability is such a huge deal unless you absolutely insist that your language support string interning.


>I would be fine having a separate Unicode string type in the standard library for those instances when you really need Unicode; this design makes the common case much simpler at the expense of making the rare case harder.

Even as a native English speaker, I'm extremely uncomfortable with the idea that we're going to make software even more difficult to internationalize than it already is by using completely separate types for ASCII/Latin1-only text and Unicode.

And it's a whole different level of Anglocentric to portray non-English languages as the "rare" case.


If you give an input box to an American I promise you an emoji will find it's way into it no matter what it's for.


So much this. Thinking that only America and the UK matter is something that was forgivable 40 years ago but not today. It’s even more bizarre because of what you point out - emojis don’t make sense if you consider them as single byte arrays. And lastly, even if you only consider input boxes that don’t accept emojis like names or addresses, you have to remember that America is a nation of immigrants. A lot of folks have names that aren’t going to fit in ASCII.

And this stuff actually matters! In a legal, this-will-cost-us-money kind of way! In 2019 a bank in the EU was penalised because they wouldn’t update a customer’s name to include diacritics (like á, è, ô, ü, ç). Their systems couldn’t support the diacritics because it was built in the 90s with an encoding invented in the 60s. Not their fault but they were still penalised. (https://shkspr.mobi/blog/2021/10/ebcdic-is-incompatible-with...)

It is far more important that strings be utf-8 encoded than they be indexable like arrays. Rust gets this right and I hope future languages will too.


"i must have indexable strings for performance reasons. oh, btw its an electron app"


Unicode has such a rich collection of symbols. I use them frequently in code comments.


such a different level, you could even call it Latincentric :)


> this design makes the common case much simpler at the expense of making the rare case harder

In the age of emoji (and uhhhh, everyone who doesn't use English as their main language), I don't think your "rare case" is really that rare.


There must be statistics around how much of the data in the world is or could be Latin-1. I'm going to guess it's a very high percentage.


wat?

Counting webpages (not bytes) the number is near 50% are English only and 50% are other languages.

Only 20% of people worldwide speak English.

I know other languages can fit inside ASCII but, really?


> In the age of emoji (and uhhhh, everyone who doesn't use English as their main language)

Open a random NYT article of more than a hundred words. It won’t be in ASCII.


Do note that emoji are a case where the "codepoint" approach fails catastrophically.


Can you please elaborate?


On paper you're not wrong, but String used for localized text are a special subclass you can deal with separately. Most Strings that will cause you problems are, you know, technical: logs, name of subsystems, client ids, client-sourced API-provided values which change format across client etc. Those, in my experience, are always ASCII even in China, exactly because nobody wants to deal with too much crap.

Display Strings are simpler to manipulate in most cases: load String from file or form, store back verbatim in DB or memory, you barely do anything other than straight copying, right ?

The way I do in Java is that I always assume and enforce my strings to be ASCII single byte, and if I want to display something localized, somehow, it never really goes through any complex logic where I need to know the encoding: I copy the content with an encoding metadata, and the other side just displays it.


"strings are arrays of bytes" combined with the assumption that "characters are a single byte" sounds basically the same as the "array of code points" that the parent comment is disagreeing with


Code points are not bytes.


Sure, but if you're insisting that the string be represented as one byte per character, you end up with the exact same properties with "array of code points" and "array of bytes"


Sort-of, but no, because code points are not characters.

There's a big difference between "get the 5th code point" and "get the 5th character".

Because multiple code points can be used in a single character, it not possible to do random-character-access in a unicode-encoded string.


No, it's impossible to do random access to retrieve a character, if you are dealing with code points, because code points do not have a fixed byte size. I thought this a good intro <https://tonsky.me/blog/unicode/>.


> For example, the number of characters in the string is simply the length of the string.

For I/O you need the amount of bytes it occupies in memory and that's always known.

For text processing, you don't actually need to know the length of the text. What you actually need is the ability to determine the byte boundaries between each code point and most importantly each grapheme cluster.

> when you really need Unicode

You always need Unicode. Sorry but it's almost 2024 and I shouldn't even have to justify this.

For I/O, you don't need "strings" at all, you need byte buffers. For text, you need Unicode and everything else is just fundamentally wrong.


> Obviously that makes internationalization harder, but the advantage is that strings are much simpler to reason about.

Internationalization relative to what? Anyway, just pick any language in the world, i.e. an arbitrary one—can you represent it using just ASCII? If so I would like to know what language that is. It seems that Rotokas can be.[1] That’s about 5K speakers. So you can make computer programs for them.

Of course this comment of mine isn’t ASCII-only.

[1] https://en.wikipedia.org/wiki/Rotokas_language


Which non-ASCII character did you use?


Out of interest, would you also say that "images are arrays of bytes"?

If not, what's the semantic difference?

For me, strings represent text, which is fundamentally linked to language (and all of its weird vagueness and edge-cases). I feel like there's a "Fallacies programmers believe about text" that should exist somewhere, containing items like "text has a defined length" and "two identical pieces of text mean the same thing".

So whilst it's nice to have an implementation that lets you easily "seek to the 5th character", it's not always the case that this is a well defined thing.


> I feel like there's a "Fallacies programmers believe about text" that should exist somewhere

I got you covered.

https://github.com/kdeldycke/awesome-falsehood#international...

http://garbled.benhamill.com/2017/04/18/falsehoods-programme...

https://jeremyhussell.blogspot.com/2017/11/falsehoods-progra...

https://wiesmann.codiferes.net/wordpress/archives/30296

I love when the writing gets visibly more unhinged and frustrated with each invalidated assumption. It's like the person's mind is desperately trying to find some small sliver of truth to hold onto but it can't because the rug is getting constantly pulled out from under it.


See also: Text Rendering Hates You https://faultlore.com/blah/text-hates-you/ and Text Editing Hates You Too https://lord.io/text-editing-hates-you-too/


Amazing, thanks. Bookmarking the lot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: