This is a horrible assumption. UTF8 can show up just about anywhere. 99% of the utilities mentioned in the article support UTF8, which means they will output UTF8 if they think that's a good idea. And that can show up at any point. Exotic filenames, error messages, fancy boxing or indicative characters in the output...
Don't expect Latin-1, ever. Respect specified encodings, and default to UTF8. Never, ever think you'll be fine with Latin-1 when dealing with an UTF-8 output just because you speak english.
The trick the original article specified is a hack to get better performance in very controlled situations, it should never be put in a rc file like it recommends because it will break and you won't know that is what's causing it.
Edit: Here's a few examples of very real potential breakage, to show you how bad an idea this is: You're debugging a python script. You open a python shell and start pasting some of its lines to figure out what's happening. OOPS, there's UTF-8 in there.
[4:08:48] adys@azura ~ % python -c 'print("I like the letter Š")'
I like the letter Š
[4:08:54] adys@azura ~ % LANG=C python -c 'print("I like the letter Š")'
Unable to decode the command from the command line:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in position 25: surrogates not allowed
(The input doesn't even have to be explicit, by the way, it could even be in your filenames).
Should I keep going? See how "awk '{print tolower($0)}'" behaves when it's dealing with utf-8 input. Maybe that awk is deeply hidden in a 5000 line script. Maybe it deals with web data, filenames, anything. And suddenly you have unicode escapes in data that doesn't expect any.
Now how's that one line in your rc file, which you wrote maybe months or years ago, working out for you? Are you going to know the issue is your LC_ALL variable?
There are multiple issues you are dealing with here. First, no tools "will output UTF8 if they think that's a good idea". All properly designed application are required to respect the LANG setting (or equivalent LC_XXX environment variables) and use character encoding specified in them. This is the basis of character encoding support in most (all?) modern UNIXes. There's no risk running with LANG set to something other than en_EN.UTF-8 when using properly implemented software (most of it), and as a matter of fact a lot of UNIX variants has C as default (actually, the only two I know which don't are Linux and OS X). I have been running with LANG set to C (or KOI8-R on OS X) on FreeBSD, Linux and OS X for more than 10 years, and yet to hit the problem. Now, there are always issues with dealing with UTF-8 data (or different 1 byte encoding to a lesser extent), but see below.
Second, you keep treating UTF-8 as some special encoding (ein OS, ein encoding?), but it's certainly not. Even if you are using UTF-8 as a system encoding, you are going to have problems dealing with non-UTF8 data, and it's actually going to be worse than dealing with UTF-8 when using LATIN1. Mind you, most systems don't use UTF-8 as default encoding: most Linux distributions do, but Windows uses UTF-16, Mac OS X has it's own somewhat incompatible version of UTF-8, Java is UTF-16, and so on. One will have problems dealing with all these systems if his encoding is set to UTF-8.
Lastly, the Python example is not really representative. Some language implementations have better support for multi-byte encodings, and some worse. Ruby, for example, has a very good support for dealing with UTF-8 data even when not using it a system encoding.
This is a horrible assumption. UTF8 can show up just about anywhere. 99% of the utilities mentioned in the article support UTF8, which means they will output UTF8 if they think that's a good idea. And that can show up at any point. Exotic filenames, error messages, fancy boxing or indicative characters in the output...
Don't expect Latin-1, ever. Respect specified encodings, and default to UTF8. Never, ever think you'll be fine with Latin-1 when dealing with an UTF-8 output just because you speak english.
The trick the original article specified is a hack to get better performance in very controlled situations, it should never be put in a rc file like it recommends because it will break and you won't know that is what's causing it.
Edit: Here's a few examples of very real potential breakage, to show you how bad an idea this is: You're debugging a python script. You open a python shell and start pasting some of its lines to figure out what's happening. OOPS, there's UTF-8 in there.
(The input doesn't even have to be explicit, by the way, it could even be in your filenames).Should I keep going? See how "awk '{print tolower($0)}'" behaves when it's dealing with utf-8 input. Maybe that awk is deeply hidden in a 5000 line script. Maybe it deals with web data, filenames, anything. And suddenly you have unicode escapes in data that doesn't expect any.
Now how's that one line in your rc file, which you wrote maybe months or years ago, working out for you? Are you going to know the issue is your LC_ALL variable?