The question is, do you need to handle UTF-8? In most cases of sever management ...

gchpaco · on June 16, 2015

How is ASCII or Latin-1 somehow not an obscure encoding scheme by your parlance? Latin-1 is less common by quite a bit on the web than UTF-8, and ASCII might be if UTF-8 wasn't a superset.

Fact of the matter is people have to get out of the habit of going "oh, that will never be relevant for me", because a) that's unlikely to be true for everybody (for example in a server environment you can't even write perfectly good North American names like Étienne let alone perfectly good older-than-the-country names like Þórr) b) it's unlikely to be true even for you as you will discover the first time you try to read a file written by somebody else c) it promotes intellectual laziness that makes later internationalization efforts difficult to impossible, because they're trying to convince people like you to care about stuff they find inconvenient but that is of vital importance to your customers.

I know of a man whose first name is Þórr who outright refuses to deal with any organization that will not let him write his name, and I have a hard time disagreeing with him.

VanDerSomebody · on June 16, 2015

> I know of a man whose first name is Þórr who outright refuses to deal with any organization that will not let him write his name, and I have a hard time disagreeing with him.

I wish I had that luxury sometimes. Here's a pet peeve. I live in France, where people expect everyone to have a one-word family name (it seems). It's really common for people and especially computer systems to "correct" my name (which is of the form Me van der Somebody) to something like Me VANDERSOMEBODY or Me Van Der Somebody. A particular web page (my employer's, actually -- where I have to fill in my name for my "profile page") does the latter, and refuses to allow me to fix it. I have submitted bug reports, and just get WONTFIX back. :(

It's really rude, so I can imagine that for a person called þórr it's a nightmare everywhere except Iceland.

stass · on June 16, 2015

The original question was about the command line. When managing servers it's perfectly accessible to use LATIN1 if one does not expect to deal with UTF-8 data (examining processes, VM states, changing configuration, etc). If that's not the case, one can use UTF-8 or some other encoding for that matter (i.e. I routinely have to deal with non-UTF non-ASCII filenames and text, using UTF-8 won't cut here either).

scrollaway · on June 16, 2015

> if one does not expect to deal with UTF-8 data

This is a horrible assumption. UTF8 can show up just about anywhere. 99% of the utilities mentioned in the article support UTF8, which means they will output UTF8 if they think that's a good idea. And that can show up at any point. Exotic filenames, error messages, fancy boxing or indicative characters in the output...

Don't expect Latin-1, ever. Respect specified encodings, and default to UTF8. Never, ever think you'll be fine with Latin-1 when dealing with an UTF-8 output just because you speak english.

The trick the original article specified is a hack to get better performance in very controlled situations, it should never be put in a rc file like it recommends because it will break and you won't know that is what's causing it.

Edit: Here's a few examples of very real potential breakage, to show you how bad an idea this is: You're debugging a python script. You open a python shell and start pasting some of its lines to figure out what's happening. OOPS, there's UTF-8 in there.

    [4:08:48] adys@azura ~ % python -c 'print("I like the letter Š")' 
    I like the letter Š
    [4:08:54] adys@azura ~ % LANG=C python -c 'print("I like the letter Š")'
    Unable to decode the command from the command line:
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in position 25: surrogates not allowed

(The input doesn't even have to be explicit, by the way, it could even be in your filenames).

Should I keep going? See how "awk '{print tolower($0)}'" behaves when it's dealing with utf-8 input. Maybe that awk is deeply hidden in a 5000 line script. Maybe it deals with web data, filenames, anything. And suddenly you have unicode escapes in data that doesn't expect any.

Now how's that one line in your rc file, which you wrote maybe months or years ago, working out for you? Are you going to know the issue is your LC_ALL variable?

stass · on June 16, 2015

There are multiple issues you are dealing with here. First, no tools "will output UTF8 if they think that's a good idea". All properly designed application are required to respect the LANG setting (or equivalent LC_XXX environment variables) and use character encoding specified in them. This is the basis of character encoding support in most (all?) modern UNIXes. There's no risk running with LANG set to something other than en_EN.UTF-8 when using properly implemented software (most of it), and as a matter of fact a lot of UNIX variants has C as default (actually, the only two I know which don't are Linux and OS X). I have been running with LANG set to C (or KOI8-R on OS X) on FreeBSD, Linux and OS X for more than 10 years, and yet to hit the problem. Now, there are always issues with dealing with UTF-8 data (or different 1 byte encoding to a lesser extent), but see below.

Second, you keep treating UTF-8 as some special encoding (ein OS, ein encoding?), but it's certainly not. Even if you are using UTF-8 as a system encoding, you are going to have problems dealing with non-UTF8 data, and it's actually going to be worse than dealing with UTF-8 when using LATIN1. Mind you, most systems don't use UTF-8 as default encoding: most Linux distributions do, but Windows uses UTF-16, Mac OS X has it's own somewhat incompatible version of UTF-8, Java is UTF-16, and so on. One will have problems dealing with all these systems if his encoding is set to UTF-8.

Lastly, the Python example is not really representative. Some language implementations have better support for multi-byte encodings, and some worse. Ruby, for example, has a very good support for dealing with UTF-8 data even when not using it a system encoding.

wodenokoto · on June 17, 2015

>I know of a man whose first name is Þórr who outright >refuses to deal with any organization that will not let him >write his name, and I have a hard time disagreeing with him.

How does he fly? All flights I've ever been on have required me to ASCII-fy my name.

gchpaco · on June 17, 2015

He mentioned being somewhat forgiving when people invariably screwed it up to 'rr' or 'orr', but I think he was Norwegian. So this may be a much more common problem with accepted solutions in his area of Europe. What drove him up the wall was when he typed his name in and the web site would be all "illegal characters in name".