Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I did something similar about a decade ago because I was using tesseract to OCR Chinese.

Part of the problem is that if you use Tesseract to recognize English text it's much easier to clean it up afterwards because if it makes a mistake it's usually in only a single character, and you can use Levenstein distance to spellcheck and fix which will help a lot with the accuracy.

Logographic languages such as Chinese present a particular challenge to "conventional post-processing" having many words represented as two characters and often a lot of words as a single "glyph". This is particularly difficult because if it gets that glyph wrong there's no way to obvious way to detect the identification error.

The solution was to use image magick to "munge" the image (scale, normalize, threshold, etc), send each of these variations to tesseract, and then use a Chinese-corpus based Markov model to score the statistical frequency of the recognized sentence and vote on a winner.

It made a significant improvement in accuracy.



People's handwriting vary widely, and a human reading someone's writing faces the same problems you mention. For a language like English, humans also decipher unrecognized characters by looking at what letter would fix the word or what word would fit in the sentence, etc.

Surely handwriting quality distribution for Chinese is not too far off from the rest of the world. How do Chinese humans read handwritten text written by someone with a bad handwriting?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: