> if one does not expect to deal with UTF-8 data This is a horrible assumption. ...

stass · on June 16, 2015

There are multiple issues you are dealing with here. First, no tools "will output UTF8 if they think that's a good idea". All properly designed application are required to respect the LANG setting (or equivalent LC_XXX environment variables) and use character encoding specified in them. This is the basis of character encoding support in most (all?) modern UNIXes. There's no risk running with LANG set to something other than en_EN.UTF-8 when using properly implemented software (most of it), and as a matter of fact a lot of UNIX variants has C as default (actually, the only two I know which don't are Linux and OS X). I have been running with LANG set to C (or KOI8-R on OS X) on FreeBSD, Linux and OS X for more than 10 years, and yet to hit the problem. Now, there are always issues with dealing with UTF-8 data (or different 1 byte encoding to a lesser extent), but see below.

Second, you keep treating UTF-8 as some special encoding (ein OS, ein encoding?), but it's certainly not. Even if you are using UTF-8 as a system encoding, you are going to have problems dealing with non-UTF8 data, and it's actually going to be worse than dealing with UTF-8 when using LATIN1. Mind you, most systems don't use UTF-8 as default encoding: most Linux distributions do, but Windows uses UTF-16, Mac OS X has it's own somewhat incompatible version of UTF-8, Java is UTF-16, and so on. One will have problems dealing with all these systems if his encoding is set to UTF-8.

Lastly, the Python example is not really representative. Some language implementations have better support for multi-byte encodings, and some worse. Ruby, for example, has a very good support for dealing with UTF-8 data even when not using it a system encoding.