It is time for POSIX to get with the times. Computers are used in more than the ...

PhilipRoman · on Dec 6, 2024

Some practical concerns I have with UTF-8 are similar (or even the same, depending on font) characters which can be used in malicious ways (think package names, URLs, etc), not to even mention RTL text and other control characters. Every time I add logging code, I make sure that any "interesting" characters are unambiguously escaped or otherwise signaled out-of-band. Having English as an international writing standard is perfectly fine and I say that as a non-native speaker with a non-ascii name.

abdullahkhalids · on Dec 6, 2024

A good chunk of the world does not speak english or latin character based languages. They should be able to interact with computers completely in their own languages and alphabet sets, even if those are written right-to-left or top-to-bottom.

Of course, someone has to do the work to make this possible. And no one is obliged to do it. But to suggest that, such work should not be done at all, does not sit right.

notpushkin · on Dec 7, 2024

This isn't quite black and white.

Right now, I can set up and use Linux in my language, have my display name in my script, but my username and password are ASCII-only and are available on the standard English keyboard anywhere. If I run into trouble, I can SSH in from any device in the world without any issue. I can just borrow a laptop from anyone, switch to English if needed, and jump right in.

Having a common denominator set of characters for such things is just really, really useful. I’d rather focus on all the other things that need to be localised.

folmar · on Dec 7, 2024

Without any issue is a stretch, using a French keyboard is bad enough experience for passwords, not everyone uses standard English keyboards.

wongarsu · on Dec 7, 2024

The French keyboard is the most notable example of anyone using something other than query or quertz. Even Japan and China use an extended querty. But even with the French keyboard the only issue is that everything is in the wrong place, not that the standard 26 "English" letters don't exist or are hard to reach.

Meanwhile using ä, é or ş in a username or password will make your life much harder once you are in a foreign country. Never mind any letter that isn't derived from the Latin alphabet.

oarsinsync · on Dec 7, 2024

> something other than query or quertz. Even Japan and China use an extended querty

qwerty

hnthrowaway6543 · on Dec 6, 2024

> A good chunk of the world does not speak english or latin character based languages.

nearly everyone in a first world country knows the English alphabet though. a vast majority of the developing world as well. just look at street view on Google maps in any country, there's going to be a ton of street signs using English characters, even in non-touristy areas.

> They should be able to interact with computers completely in their own languages and alphabet sets, even if those are written right-to-left or top-to-bottom.

if you're a typical android/ios end user you're interacting with a computer in your native language anyway. this discussion only applies to low level power users.

in that case: why? these aren't user-facing features. this is like saying that people should be able to use symbols native to their language rather than greek letters when writing math papers.

it might not be "fair" that English is overrepresented in computing but it also hasn't demonstrably been a barrier to entry. Japan, Korea and China have dominated, particularly in hardware.

if you think it should be fixed why stop at usernames? why represent uids with 1234 instead of 一二三四?

citrin_ru · on Dec 7, 2024

> nearly everyone in a first world country knows the English alphabet though

And not only 1st world. Actually the bigger country the more everything is localized - from dubbed films to food packaging labels. In a small country one would see more English/Spanish/French e. t. c. because they don't have resources to localize everything.

abdullahkhalids · on Dec 6, 2024

> if you're a typical android/ios end user you're interacting with a computer in your native language anyway. this discussion only applies to low level power users.

I don't think you realize how poor this experience is. Partly the reason being that the underlying system is so english focused, that app developers have to do so much work to get things working.

> if you think it should be fixed why stop at usernames? why represent uids with 1234 instead of 一二三四?

I mean, if the computers had first been built in south east asia, they would have been.

hnthrowaway6543 · on Dec 6, 2024

it's certainly hard to localize everything but billions of people use ios/android in India, China, SEA, MENA, etc... i think it's fair to say that at the end user level, computers are in fact usable by non-English speakers.

individual apps may not be as usable, but that's on the developers. good counter-example, a lot of japanese games, even made within the past 5 years, require setting the Windows system locale to Japanese to function properly. and as someone who played a fair number of japanese doujin games in the 00s/10s, it used to be every game with this problem.

> I mean, if the computers had first been built in south east asia, they would have been.

debatable as CJK heavily use Arabic numerals everywhere, but even if they did, so what? you'd learn those symbols and get used to it. the same way that if you're a unix sysadmin you get used to only being able to use a small subset of ASCII characters for usernames.

abdullahkhalids · on Dec 6, 2024

> it's certainly hard to localize everything but billions of people use ios/android in India, China, SEA, MENA, etc... i think it's fair to say that at the end user level, computers are in fact usable by non-English speakers.

Its important to contextualize these discussions in socioeconomics. Computers are not just fun play things. They are serious tools used for economic activities. Their usage, through their design, has significant impact on the social systems of society. Non-latin-language speakers are able to use poorly localized computers, but they are only able to use them less well than the latin-language speakers. At least in South Asia, there is a huge economic divide between those who can speak English and those who can't, where causality runs both ways, and in more recent times exacerbated by the inability of some to use technology. And that economic divide then causes huge sociopolitical problems in societies.

If computers are means for economic progress, we shouldn't put the condition that one has to somehow learn English to use them well. But isn't localization sufficient? No it isn't. Ignore even that localization requires some members of your language to be dual speakers. The current era of economic progress is characterized by software development. But if the only way you can develop software is to learn a foreign language, then surely we are denying economic progress to some communities.

P.S. I will repeat. Nobody has to do any work to help other communities. But to assert that such work should not happen is plain wrong.

hnthrowaway6543 · on Dec 7, 2024

you're confusing "speaking English" with "knowing the English alphabet." these things are orthogonal. 95%+ of people in those countries know the english alphabet. i just threw down google maps street view at a random spot in Phnom Penh and instantly found english letters visible from the street, on advertisements[0]. then i threw it down in a much smaller Thai city that i had never heard of, Nakhon Sawan, and instantly found English on the street.[1] i've been in China, Japan and Korea enough to know english characters are all over the place. the English alphabet is omnipresent everywhere, i think you fail to realize this. nobody who is using a computer in these places is getting confused by the english alphabet.

> But to assert that such work should not happen is plain wrong.

i assert it should not happen because it's not solving an actual problem, the same way that changing "x" and "y" to "ㅋ" and "ㅌ" in algebra doesn't solve a problem, and trying to "solve" it will yield to a monstrous amount of incompatibilities and confusion. here's a really good comparison: ipv6. IPv6 is solving a problem, maybe in a way people disagree with, but definitely a real problem... and yet we still can't make ipv6 fucking work after God knows how many years, and trying to get IPv6 networking at any sort of scale is a massive fucking headache. now we want to go through the same headaches to support... umlauts in usernames? yeah, no thanks.

there's enough real work left to be done in the world that we shouldn't waste time with stupid makework like this.

or maybe in 30 years i'll be able to call up IT support and say "hey i forgot my password, can you reset it? my username is 神王 سعود. ... need me to spell that for you?"

edit: somewhat ironically, HN swallowed a few of the unicode characters in my theoretical future username...

[0] https://i.imgur.com/0WkG0ze.png

[1] https://i.imgur.com/VhDR5Xh.png

abdullahkhalids · on Dec 7, 2024

I am from Pakistan. At least in South Asia, there are english characters everywhere because the infrastructure is primarily designed for the rich english-speaking classes, while the poor are left behind. A serious political problem.

I have seen many non-english speaking people interact with computers in English, both poor people and old folks in rich families who don't know English. They kinda recognize the shape of words, or they go by icons. They don't actually know the meaning of anything. They can only do a limited set of pre-memorized actions. Scamming them is easy. If they get stuck, they need to beg someone to help them.

Again, I will say this. There are two problems here. One for users and one for developers. Users must be able to read in their own language. Developers must be able to develop in their own language.

wongarsu · on Dec 7, 2024

> They kinda recognize the shape of words, or they go by icons. They don't actually know the meaning of anything.

That's kind of true of a lot of English computer users too.

But more to the point, what you are advocating for is translating the interface. Which I think nobody is against, and which is a common thing to do (at least for countries people care about, which sadly excludes a lot of the poorer parts of the world). The username prompt should read "username" in Pakistani. That doesn't automatically mean it has to accept non-ascii input too, as long as you accept unicode in the display name.

> Developers must be able to develop in their own language.

I learned coding in Pascal before I learned that "if" is an English word. English helps, but in the end keywords in programming languages and shell commands are only mnemonics. Knowing the translation helps but isn't necessary. What's important are documentation, tutorials and other resources in a language the developer understands.

citrin_ru · on Dec 7, 2024

I have an impression that people confuse learning English (which is hard unless you native language is a Germanic/Romance one) with learning to recognize and type Latin characters which is easy and people around the world already use Latin alphabet without knowing any English. You may escape Latin alphabet if you have spend a whole life in a remote village but for people living in cities around the world it should be familiar and not a barrier at all. It's hard to escape Latin characters in the modern world and this ship has already sailed like it or not (I mostly do).

Muromec · on Dec 6, 2024

Oh no please, I don’t want to have my linux username in Cyrillic. Thanks but no, thanks!

I know enough linux to see 10 ways in which it will make things worse at some point.

smitelli · on Dec 6, 2024

> similar (or even the same, depending on font) characters which can be used in malicious ways

These are called "confusables" and boy does that well run deep: https://www.unicode.org/Public/security/16.0.0/confusables.t...

throw0101a · on Dec 6, 2024

> It is time for POSIX to get with the times.

"Be the change that you wish to see in the world." — Mahatma Gandhi

It's free to join:

* https://www.opengroup.org/austin/lists.html

* https://www.opengroup.org/austin/

7bit · on Dec 6, 2024

[flagged]

throw0101a · on Dec 6, 2024

> Most useless post so far

Is GP willing to help out? To go through data structures and file formats (like pax[0][1] (née ustar (née tar))) to find places where things will need to be changed?

It's easy to say "Someone else should fix things."

[0] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/p...

[1] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/p...

numpad0 · on Dec 7, 2024

NO. PLEASE DON'T. This wreaks havoc especially on East Asian users because Unicode is poorly supported in console on top of being binary non-canonical in both entry and display.

Meaning,

  - :potato: OR :potatoh: may display as :eggplant: OR :potato:    
  - isEqual(`:eggplant:`, `:eggplant:`) may fail OR succeed   
  - trying to type :sequence: breaks console until reboot  
  - typing :potato: may work but not :eggplant:  
  - users don't know how to spell :eggplant:  
  - etc.

If you must, please fix Unicode first so that user entry and display would have 1:1 relationship. I do have Han Unification in mind, but I believe the problem isn't unique to the unification or East Asia.

rurban · on Dec 6, 2024

Almost nobody supports string search and comparison API functions for unicode. The unicode security tables for unicode identifiers are hopelessly broken.

Not even the simplest tools, like grep do support unicode yet. This didnt happen in the last 15 years, even if there are patches and libs.

ygra · on Dec 7, 2024

Wasn't one way to make grep faster setting LANG=C to avoid using language-aware string comparison? If so, shouldn't Unicode be supported by default or what would, say, de_DE.UTF-8 actually compare to make it slower?

rurban · on Dec 9, 2024

yes it should. but the libunistring variant was too slow. And since LANG is run-time evaluated you cannot really provide pre-compiled, better search patterns.

sometime I'll come up with pre-computed optimized tables, but no time.

JetSpiegel · on Dec 12, 2024

It's just a grep bug, ripgrep is fast and supports proper regex.

atoav · on Dec 6, 2024

Sure, go ahead. Write the PR and make sure to test against all other things used in production.

Let's talk again in 30 years when you're done.

jerf · on Dec 6, 2024

Oh, it's been closer to 20 years for the rest of the world to catch up to Unicode than 30. We aren't at "perfect" now but we're certainly down to the trickier corner cases that are difficult to even see how you solve the problems at all, let alone code the solutions, and that's just reality's ugly nose sticking in to our pristine world of numbers.

But there really isn't any other solution. Yes, there will be an uncomfortable transition. Yes, it blows. But there isn't any other solution that is going to work other than deal with it and take the hits as they come. The software needs to be updated. The presumption that usernames are from some 7-bit ASCII subset is simply unreasonable. We'll be chasing bugs with these features for years. But that's not some sort of optional aspect that we can somehow work around. It's just what is coming down the pike. Better to grasp the nettle firmly [1] than shy away from it.

At least this transition can learn a lot from previous transitions, e.g., I would mandate something like NFKC normalization applied at the operating system level on the way in for API calls: https://en.wikipedia.org/wiki/Unicode_equivalence Unicode case folding decisions can also be made at that point. The point here not being these specific suggestions per se, but that previous efforts have already created a world where I can reference these problems and solutions with specific existing terminology and standards, rather than being the bleeding-edge code that is figuring this all out for the first time.

[1]: https://www.phrases.org.uk/meanings/grasp-the-nettle.html

atoav · on Dec 6, 2024

Don't get me wrong, I think using UTF-8 everywhere is how things should be.

But this is not a "let's just" or "why don't we" type of endeavor. This is a major undertaking, and as such people are needed who (A) think it is worth the effort and (B) are willing to follow through with all the consequences.

Open Source software lives from contributions and if you're not willing to do it, why should others spend years of their lives for it?

In the end this is a question of: are the benefits worth the effort? What do we win? Where do things get simpler? Where more complicated? How do you pull it off if half the distributions use UTF8 and the other half uses the legach way? How would tooling deal with this split? etc.

atoav · on Dec 7, 2024

To add a little bit of context:

You know what I think would be way worse than todays reduced characterset usernames with some special rules or "just" using utf-8 for them?

Both. Imagine a world where some usernames are UTF-8 some are not and it is hard to figure out which is which. That would be worse than just leaving things as they are.

Avoiding that situation makes pulling the whole thing off even harder, since there needs to be a high amount of coordination between many projects, distros etc.

gray_-_wolf · on Dec 7, 2024

> Unicode case folding decisions can also be made at that point

Ok I will bite. How do you indent to do case folding without knowing the language the string is in? Will every filename or whatever also have its language as part of the string? I am not sure what the plan is there.

citrin_ru · on Dec 7, 2024

Unicode opens a whole can of worms. World is already full of software which in theory supports non-ASCII texts but in practice breaks for some use cases. It's easy to allow UTF8, it's hard to test all possible use cases and to foresee them to know what to test. Nowadays I use mostly English so don't see localization bugs but when I used my native language with software/internet (~10y ago) I've encountered too many bugs and avoided using non-ASCII in things like usernames/password, file names and other places when utf-8 may be allowed but causes problems later. Just allowing UTF-8 is rarely enough. Localization is hard so better to start with places where it is important. Usernames IMHO not one of them.

chikere232 · on Dec 6, 2024

Sounds like lots of work and a lot of new bugs for no real value.

miki123211 · on Dec 6, 2024

> Computers are used in more than the US and Canada

Even if you speak US (or Canadian) English exclusively, there are still some words that are just impossible to spell correctly in pure ASCII, e.g. résumé, café etc.

drdeca · on Dec 6, 2024

“correctly”. I don’t consider it “incorrect” English when someone writes “cafe” or “resume”. It seems to me a little bit pædantic to insist that those words must have the accent marks in order to be correct (when using them in English).

sneak · on Dec 6, 2024

Yeah, loanwords are different words than the original word.

The correct plural of "baby" in German is "babys".

somat · on Dec 6, 2024

I would say it is not the place of posix to prescribe how it should be, the job of posix is describe what it is, a common operating environment. this is why posix is such a mess and why I feel it is not a big deal to deviate from posix, however posix fills an important role in getting everyone on the same page for interoperability.

In my opinion the way to improve this, is bottom up, not top down. Start with linux(theese days posix is largely "what does linux do?"), get a patch in that changes the defination of the user name from a subset of ascii to a subset of utf-8. what subset? that is a much harder problem with utf-8 than ascii, good luck. get a similer patch in for a few of the bsd. then you tell posix what the os's are doing. and fight to get it included.

On the subject of what unicode subset. perhaps the most enlightened thing to do is the same as the unix filesystem and punt. one neat thing about the unix filesystem is that names are not defined in an encoding but as a set of bytes. This has problems and has made many people very mad. but it does mean your file system can be in whatever encoding you want, transitioning to utf-8 was easy(mainly doe to the clever backwards compatible nature of utf-8) and we were not locked into a problematic encoding like on windows. perhaps just define that the name is a array of bytes and call it a day. that sounds like the unix way to me.

tssva · on Dec 6, 2024

"however posix fills an important role in getting everyone on the same page for interoperability."

Isn't that exactly what the posix username rules are doing? Specifying a set of characters which are portable across systems to allow for interoperability between current and legacy unix systems along with most non-unix systems.

"Start with linux"

Which linux? Debian/Ubuntu, Redhat/Fedora, shadow-utils, and systemd all differ.

"get a patch in that changes the defination of the user name from a subset of ascii to a subset of utf-8"

ASCII is a subset of UTF-8 so the POSIX definition already specifies a subset of UTF-8.

patrick451 · on Dec 7, 2024

Honestly, I just don't care. UTF8 is excessively complicated. ASCII is simple.