I was an early adopter of Mercurial and the teams insistence that file names wer...

pdonis · on Jan 13, 2020

> the teams insistence that file names were byte strings was the cause of lots of bugs when it came to Unicode support

File names are a different problem because Windows and Unix treat them differently: Unix treats them as bytes and Windows treats them as Unicode. So there is no single data model that will work for any language.

hsivonen · on Jan 14, 2020

The Rust standard library has a solution for this that actually works: On Unix-like systems file paths are sequences of bytes and most of the time the bytes are UTF-8. On Windows, they are WTF-8, so the API users sees a sequence of bytes and most of the time they match UTF-8.

This means that there's more overhead on Windows, but it's much better to normalize what the application programmer sees across POSIX and NT while still roundtripping all paths for both than to make the code unit size difference the application programmer's problem like the C++ file system API does.

pdonis · on Jan 14, 2020

> On Windows, they are WTF-8

Seems like an apt acronym for Windows... :-)

On a more serious note, Python seems to have done something fairly similar with the pathlib standard library module.

simias · on Jan 13, 2020

Not to mention case-sensitivity issues. Can you have two files, one named "FILE.txt" and the other "file.txt" in the same directory for instance?

SSLy · on Jan 13, 2020

On windows? Of course you can.

edgyquant · on Jan 14, 2020

I'm certain you can on Linux as well. Only Macs old HFS would not allow it.

cataflam · on Jan 14, 2020

Isn't this a fairly recent change?

amaranth · on Jan 14, 2020

NTFS has always been case sensitive, Windows API just lets you treat it as case insensitive. If you pass `FILE_FLAG_POSIX_SEMANTICS` to `CreateFile` you can make files that differ only in case.

mathw · on Jan 14, 2020

Good luck using those in some tools which use the API differently though. Windows filenames are endless fun. What's the maximum length of the absolute path of a file? Why, that depends on which API you're using to access it!

rurban · on Jan 15, 2020

Even worse on Unix where it depends on the mount type. Haven't seen much proper long filename support in Unix apps or libs, it's much better in Windows land. Garbage in garbage out is also a security nightmare as names are not identifiable anymore. You can easily spoof such names.

gpderetta · on Jan 13, 2020

Hum, any program that doesn't treat filenames as bytestreams on unix is broken. Doubly so if its primary purpose is preserving and archiving files.

Are you sure the issue wasn't something else?

lmm · on Jan 14, 2020

The point is that filenames aren't bytestreams on windows, and if you treat them as such then your program won't work.

WorldMaker · on Jan 14, 2020

By this point, any cross-platform file tool that isn't using Unicode as a lowest-common denominator for filenames and similar things to insure maximal compatibility is likely ready to cause havoc.

(The remarks in the post here that Mercurial on Python 3 on Windows is not yet stable and showing a lot of issues is possibly even an indicator/canary here. To my understanding, Python 2 Windows used to paper over some of these lowest common denominator encoding compatibility issues with a lot more handholding than they do with the Python 3 Unicode assumption.)

lmm · on Jan 14, 2020

> By this point, any cross-platform file tool that isn't using Unicode as a lowest-common denominator for filenames and similar things to insure maximal compatibility is likely ready to cause havoc.

Be that as it may, Mercurial has existing repositories that may use non-unicode filenames, and just crashing whenever you try to operate on them is probably not an acceptable way forward.

WorldMaker · on Jan 14, 2020

Sure, but that's also not the only resulting option; instead of erroring you could also do something nice like help those users migrate to cleaner Unicode encodings of their filenames by asking them to correct mistakes or provide information about the original encoding. It takes more code to do that than just throwing an error, of course, but who knows how many users that might help that don't even realize why their repositories don't work correctly on, say, Windows.

Dylan16807 · on Jan 14, 2020

Windows filenames basically are bytestreams. But the bytes come in pairs.

lmm · on Jan 15, 2020

Not really. Certain byte sequences are invalid.

Dylan16807 · on Jan 15, 2020

Certain byte sequences are invalid in unix filenames too. So that can't be the factor that decides if they are bytestreams or not.

xorcist · on Jan 13, 2020

If hg borked on non-ascii characters, it sounds like the problem was rather that it didn't treat that data as a bag-of-bytes. Not the other way around?

ploxiln · on Jan 13, 2020

He was trying to use Windows. For Windows, you pretty much have to go through unicode to utf-16, can't be arbitrary bytes, can't be utf8.

(I think that relatively recently it is possible to use utf8 with some new windows interfaces ... but this is probably not widely compatible with older windows releases ...)

Dylan16807 · on Jan 14, 2020

Windows uses arbitrary shorts that are sort of supposed to be utf-16. Just like Unix uses arbitrary bytes that are sort of supposed to be utf-8.

You have to convert between them, but neither uses proper Unicode to represent filenames.

cbsmith · on Jan 13, 2020

Yeah, but utf-16 is still bytes. It's just bytes with a different encoding.

But I do see the pain with Python 3 where the runtime tries to hide these kinds of issues from you. That abstraction can make it difficult to have the right behaviour.

mynegation · on Jan 14, 2020

Everything is bytes but the meaning assigned to bytes, matters. Let’s say I create a file named «Файл» on Unix in UTF8 and put it into git repo. For Unix it is a sequence of bytes that is representation of Russian letters in UTF8. So far so good. Now I clone this repo to Windows, what should happen? The file can not be restored with the name as encoded into bytes on Unix, that will be garbage (that even has a special name “Mojibake”) in the best case or fail outright in the worst. What should happen is decoding of those bytes from UTF8 (to get original Unicode code points) Into Unicode code points, then encoding using Windows native encoding (UTF-16).

mikepurvis · on Jan 14, 2020

True, but one of those representations still needs to be canonical one in the repo for the purposes of hashing into the commits and so on.

Git builds a bunch of logic like this in around handling line endings in text files.

cbsmith · on Jan 14, 2020

Everything isn't bytes. Strings without an encoding don't have a specific byte representation.

takeda · on Jan 14, 2020

It's the other way around. Strings always have meanings and always reference the same characters. You use encoding to encode strings into bytes.

Bytes without encoding, don't have any meaning, they are just... random bytes.

cbsmith · on Jan 15, 2020

We're actually saying the same thing. You're saying without an encoding you can't turn bytes into a string (technically, in Python terminology, that's a decoding, but you know... ;-). I'm saying a string doesn't have a byte representation without an encoding. That's two perspectives on the same truth.

I absolutely agree that a string has meaning without a byte representation. That's the whole point of having it as a distinct type.

lmm · on Jan 14, 2020

UTF-16 is not "just bytes". There are sequences of bytes that are not valid UTF-16, so if you want to roundtrip bytes through UTF-16 you have to do something smarter than just pretending the byte sequence is UTF-16.

cbsmith · on Jan 14, 2020

Sorry, I wasn't trying to imply that any permutation of bytes would work. If you encode it improperly, it's not going to work.