Performance Promise of Adaptive OCR Strategies
----------------------------------------------
in a Multi-lingual Environment
------------------------------
Tin Kam HO George NAGY
Bell Laboratories Rensselaer Polytechnic Institute
Murray Hill, NJ 07974 USA Troy, NY 12180 USA
tkh@bell-labs.com nagy@ecse.rpi.edu
Adaptive OCR strategies [1][2][3] aim at training an OCR system on-the-fly
using knowledge derived from an input sequence of page images. The
resulting system is expected to be specialized to the characteristics of
symbol shape, context, and noise that are present in images from the
same source, and thus should achieve higher accuracy.
The essence of an adaptive strategy is that we first recognize some
familiar or easily identifiable shapes, and then propagate the knowledge
to more difficult shapes through internal shape consistency and contextual
constraints. This can be a recurrent process involving frequent feedbacks
and backtracking. Also, the process may be initiated with no known
"shapes" at all, e.g., if one is dealing with text in ancient scripts,
exotic fonts, Braille, or spoken input. It is also possible to involve
more than one form of input in the bootstrapping process, e.g. recognition
may be initiated by reading out a few paragraphs.
There are obvious advantages in building a specialized OCR for large-volume
scanning under similar conditions, such as in operations involving
multi-page documents and periodicals from the same publisher.
Moreover, such a strategy is especially valuable in an environment
where one is expected to handle documents in different languages.
A good adaptive strategy will enable us to build a generic OCR that has
a minimal core, and leave most of the sophisticated tuning to on-line
learning. This philosophy was proposed in a number of prior works [4].
There are some characteristics shared by most written languages that
can be utilized in building an adaptation strategy. However, it appears
that some languages are better than others from the perspective of adaptation.
In evaluating a language for its suitability for adaptive recognition,
we consider:
-- number of symbols with isolatable, distinctive shapes at the
character or word level:
alphabetic languages, ideographic languages
the size of the symbol set used in a language affects directly the
difficulty of shape based clustering
-- alternative shapes for the same alphabetic symbol or group of symbols:
context dependent forms of the same symbol in Arabic, double s in German,
ligatures like ffi, oe in Latin
-- recognizable / separable families of symbols:
content symbols, punctuations, numerals, abbreviations
-- detectable geometrical/layout conventions:
topline / baseline alignment, writing directions,
beginning of paragraph and in paragraph indentations,
differences in sizes of symbols
-- sharing of some shapes used by other languages:
punctuation, numerals, diacritical marks
these are not necessarily present in some ancient forms of languages
or spoken input
-- recognizable / separable font families:
typecases, boldface, serifs, italics, mathematical, technical, or
discipline-specific symbols (e.g. chess notation)
-- strength of contextual constraints:
n-grams, size of common lexicon, skew in word frequencies (identifiable
stop words), degree of agglutination (which flattens out word frequencies)
these affect directly the possibility of bootstrapping, i.e., propagation
of known shapes to unknown ones
-- repetitive syntactic patterns
-- recognizable defects unique to the language
We believe that it will be interesting to analyze an arbitrary language
along these lines on suitability for adaptive recognition. Such a study
will help us minimize the intense engineering effort currently required
to build an OCR.
References
[1] A. L. Spitz, An OCR Based on Character Shape Codes and Lexical Information,
Proceedings of the 3rd ICDAR, Montreal, August 14-18, 1995, 723-728.
[2] G. Nagy, Y. Xu, Automatic Prototype Extraction for Adaptive OCR,
Proceedings of the 4th ICDAR, Ulm, August 18-20 1997, 278-282.
[3] T. K. Ho, Bootstrapping Text Recognition from Stop Words,
Proceedings of the 14th ICPR, Brisbane, August 17-20, 1998, 605-609.
[4] D. J. Ittner, H. S. Baird, Language-Free Layout Analysis,
Proceedings of the 2nd ICDAR, Tsukuba Science City, Oct 1993, 336-340.