Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm going to split some hairs, because it matters for the topic at hand.

>Unicode of course takes up more space and fills up your buffer sooner. Looks like the jump happens after 8 chars.

It sounds like you are conflating unicode with UTF-8. There is more than one way to represent the unicode code points, and UTF-8 is one of them. Further, it seems like you assume that "unicode characters" have a constant size. This is a potentially dangerous misunderstanding of how UTF-8 works. UTF-8 code points have a variable number of bytes (from one to four bytes, IIRC.) You happen to have copied some code points that take 3 bytes each.

The UTF-8 encoding scheme is a great compromise, and the wikipedia article is easy to follow: http://en.wikipedia.org/wiki/UTF-8



I also used to believe Unicode and UTF-8 were different types of encoding until someone corrected me. I just remembered why I had thought such a thing in the first place:

http://msdn.microsoft.com/en-us/library/system.text.encoding...


You and probably everyone else the first time they encounter unicode / UTF-8. I wonder if it's because both terms start with 'U'.


Thanks for correcting my errors.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: