Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, but utf-16 is still bytes. It's just bytes with a different encoding.

But I do see the pain with Python 3 where the runtime tries to hide these kinds of issues from you. That abstraction can make it difficult to have the right behaviour.



Everything is bytes but the meaning assigned to bytes, matters. Let’s say I create a file named «Файл» on Unix in UTF8 and put it into git repo. For Unix it is a sequence of bytes that is representation of Russian letters in UTF8. So far so good. Now I clone this repo to Windows, what should happen? The file can not be restored with the name as encoded into bytes on Unix, that will be garbage (that even has a special name “Mojibake”) in the best case or fail outright in the worst. What should happen is decoding of those bytes from UTF8 (to get original Unicode code points) Into Unicode code points, then encoding using Windows native encoding (UTF-16).


True, but one of those representations still needs to be canonical one in the repo for the purposes of hashing into the commits and so on.

Git builds a bunch of logic like this in around handling line endings in text files.


Everything isn't bytes. Strings without an encoding don't have a specific byte representation.


It's the other way around. Strings always have meanings and always reference the same characters. You use encoding to encode strings into bytes.

Bytes without encoding, don't have any meaning, they are just... random bytes.


We're actually saying the same thing. You're saying without an encoding you can't turn bytes into a string (technically, in Python terminology, that's a decoding, but you know... ;-). I'm saying a string doesn't have a byte representation without an encoding. That's two perspectives on the same truth.

I absolutely agree that a string has meaning without a byte representation. That's the whole point of having it as a distinct type.


UTF-16 is not "just bytes". There are sequences of bytes that are not valid UTF-16, so if you want to roundtrip bytes through UTF-16 you have to do something smarter than just pretending the byte sequence is UTF-16.


Sorry, I wasn't trying to imply that any permutation of bytes would work. If you encode it improperly, it's not going to work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: