This sounds reasonable. When the GoAWK creator tried to add Unicode support thro...

benhoyt · on Aug 21, 2022

Yes, that's right. With my simplistic UTF-8-based implementation it turned length() -- for example -- from O(1) to O(N), turning O(N) algorithms which use length() into O(N^2). See this issue: https://github.com/benhoyt/goawk/issues/93

Similar with substr() and other string functions, which when operating as bytes are O(1), but become O(N) when trying to count the number of codepoints as UTF-8.

GNU Gawk has a fancier approach, which stores strings as UTF-8 as long as it can, but converts to UTF-32 if it needs to (eg: the string is non-ASCII and you call substr).

It looks like Brian Kernighan's code has the same issue with length() and substr(). I'm going to try to email him about this, as I think it's kind of a performance blocker.

benhoyt · on Aug 21, 2022

Brian sent me a lovely (and surprisingly lengthy) reply. He noted that length() was already O(N), as onetrueawk just represents strings as NUL-terminated C strings and uses strlen() (though strlen is a lot faster than decoding UTF-8). However, he said that substr() and index() have indeed gone from O(1) to O(N), as it used to be possible to just index by number of bytes.

The most challenging part, he said, was updating the regex engine, which "was written by Al Aho in 1977 and hasn't really been touched much since".