Multilingual OCR Activities and Interest at ScanSoft
Yang HE and Ben WITTNER
ScanSoft, Inc., 9 Centenniel Dr, Peabody, MA 01960 USA
At ScanSoft, Inc., formerly Xerox Imaging Systems, we develop and
market a commercial OCR software called TextBridge. It currently
supports 56 different languages. They are organized in 6 groups -
American/European, Baltic, Central European, Cyrillic, Greek, and
Turkish. The current system can load one language group at a time
and recognize all languages in that group on the same page.
We have collected large ground truth data sets for our own development
purpose. The ground truth files are in either Latin 1 Codepage or
Unicode format. It has some mark-ups for certain format information.
But for accuracy evaluation only, the mark-ups are filtered out.
We have developed two different string matchers. One is 8-bit
codepage based. It tries to align OCR output lines with ground truth
lines and count both character and word errors. The other is Unicode
based that aligns the whole page of OCR output to the ground truth as
a single string. It can count character errors only at this point,
and we have not tested it with many languages yet. For our daily
development purpose we now convert the Unicode ground truth to its
corresponding codepage and use the first matcher.
We are interested in all the topics listed in the workshop's
"Technical Focus". In addition, through the process of developing
and using our data/tools, we would like to learn other people's
opinion and work on how to handle the following specific issues:
* Issues associated with a Unicode based matcher that can check
character and word errors across all languages.
* Punctuation "normalization", i.e., how to treat punctuation
of the same functionality but with different shape and/or code
as the same in different languages.
* Same words with and without diacritics. For example, "resume".
* Unusual text line orientations and running directions.
* Non-unique conversion between Big5 and GB code in Traditional
and Simplified Chinese. Can conversion be unique via Unicode?
Non-unique Unicode/Codepage conversions in other languages?
* Furigana (the lowered and smaller Japanese characters for