Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wide characters are best avoided even on platforms where it doesn't mean UTF-16. It's better to stay in UTF-8 mode, and only verify that it's well-formed.


But at some point you'll want to know whether that code point you read `iswalpha()` or whatever, so you'll have to decode UTF-8 anyway.


At the parser-level, though; not down in the lexer. I intern unique user-defined strings (just with a hashcons or whatever the cool kids call it, these days). That defers the determination of correctness of UTF-kness to "someone else".


Figuring out whether a character should become part of a number or a name, for instance, is typical lexer stuff though. For that you have to classify it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: