On Multi-script OCR System Evaluation
-------------------------------------

B. B. CHAUDHURI

Computer Vision and Pattern Recognition Unit
Indian Statistical Institute
203 Barrackpore Trunk Road
Calcutta -700 035 INDIA
email: bbc@isical.ac.in

There has been a growing interest in multilingual and
multiscript OCR technology during recent years. Usually, such
systems are designed to read more than one script in the same
document. There are other systems which work with one script in a
document, but they can be re-targetted for different script in a
different document. The multilingual multiscript OCR technology
is important for business transactions across Europe and Orient.
It has a great significance in the country like India which has
many official state languages and scripts. Official documents are
often printed in three languages namely English, Hindi and the
local state language.

However, the OCR technology on Indian scripts is not so
mature and only a bilingual OCR system has been reported in
recent past [1]. In that work the authors considered Hindi
(Devnagari) and Bengali (Bangla) languages (scripts). S ome work
on automatic identification of Indian scripts are also reported
[2,3].

In the international scene, many monoscript systems are
commercially available but multiscript OCR systems are yet to be
launched. It is still useful and important to develop enchmarking
and evaluation methodologies for multilingual OCR in a general
way that may be employed to evaluate the systems developed at a
later period. While mono-script OCR system evaluation approach
has been more or less stadardized over the past years,
multiscript OCR evaluation approach is at its infancy. The
purpose of this article is to report our views on some aspects of
this problem.

The OCR evaluation approaches are broadly classified into
two categories: blackbox evaluation and whitebox evaluation. In
the blackbox case the OCR is considered as a blackbox where only
the input and output are visible to the evaluator. In a whitebox
situation, outputs of different modules (eg. Skew correction,
zone segmentation, character extraction) of the system may be
accesed and evaluated. However, all systems may not be of moduler
form and even if they are, the manufacturer may not provide
access to individual modules.

Nevertheless, the primary issues related to both types of
evaluation approaches are speed and accuracy of the system. The
accuracy is usually computed in percent of characters or words.
Sometimes, the error is further subclassified into
misrecognition and rejection errors. On the other hand, it can be
subclassified into insertion, deletion and substitution errors.
The evaluation can be made against font size and style, print and
paper quality variations as well as noise and degradation. The
scanning resolution can also be a variable parameter. Some
typical valuation detail with these parameters are reported in
[4,5].

In case of multi-script OCR systems, several new problems
Will crop up. First, the system may need to distinguish between
regions of different scripts before activating the OCR engine of
he particular script. Such system has a script identification
module as a preprocessor unit. In a whitebox environment, the
performance of this identification module should also be tested.
For example, we have reported an identification approach [4] that
may work for a region as small as that occupied by a four
character word. On the other hand, some methods need a larger
region, occupied by several text lines.

In a blackbox system the relative proportion and block
size of different scripts in a document can be used as parameters
against which the system performance can be evaluated. Also, the
performance error may be subcategorized into misrecognition
across the scripts, where a character of script X may be
misrecognized as another character of script Y. This result will
be important if a multi-lingual OCR system does not distinguish
separate text regions and employs a single integrated recognition
system for the alphabets of all the scripts.

In a multi-script system, another issue of consideration
is the script complexity of different extent. For a bilingual
situation, the alphabet of one script e.g. English (Roman) may be
simple and small in number while the other script e.g. Hindi
(Devnagari) may be more complex with a larger set of alphabet.
Moreover, the writing or printing style may make the characters
connected in a word, as in Arabic, while this is not so for most
European language scripts. Thus, relative comparison of
erformance across scripts is a difficult task and it is necessary
to devise a script complexity measure for this purpose.

In order to evalute various systems, a standard set of
data should be used so that the evaluation is unbiased. Kanungo
and Resnik [6] advocated the use of the Bible for the purpose.
They argue that Bible translations are closely parallel in
content, relevant with respect to modern-day language and quite
inexpensive. Other books whose translation have similar
properties, like Koran and Bhagavad Gita, are also suggested as
the data set.

We belive that the statistical aspect of evaluation is
neglected in this approach. A data set should be such that it
reflects the global character (unigram and bigram) probability
of occurance of the text in the language. In such a case, the
estimated error also reflects the expected error of the OCR
system. Suppose the character occurrence frequency of a book in
one language reflects the probability of occurrence. But it is
not guranted that its translation in another language with
different alphabet will also reflect the character occurrence
probability in that language.

Another problem of concern is for scripts with large
alphabet as in Bangla or Hindi where combination of several
consonants are written in a compound character form. Some of
these compound characters are rarely used in words. An example is
the compound formed by three consonant combination 'ksm' which is
used in the word 'laxmi', the name of a Hindu goddess. Such words
cannot occur in the Bible or its translation. If such a
translation is used to evaluate Hindi or Bangla OCR system then
the compound character ksm cannot be tested at all. The comment
perhaps holds for Chiness Japanese script as well.

There are some popular and active languages in which a lot
of document are produced, but they undergo spelling modifications
and graphemic changes over the years An example is Bangla, that is
the second most popular language in Indian subcontinent. If an
old Bangla document (may be newly published) is chosen for the
corpus and ground truth, then it may not do justice to a modern
OCR system. This statement holds good for several Indian scripts.
So, the dataset should be modern language based, with a full
coverage of script alphabet of different languages and it should be
large enough to reflect the statistical occurrence probability of
the characters. The size and style variation of the data set
should also be generated so that the amount of data in a
particular size and style truly reflects how frequently they are
used.

A systematic study of different scripts is necessary before
designing the experiment for OCR evaluation. The popular scripts
in the world can be grouped into three categories in terms of
alphabet size. English and most European scripts have alphabet
size smaller than a hundred (i.e. order of tens) and they remain
disjoint in the running text. South Asian scripts (except Arabic)
have a few hundred graphemic shapes (i.e. order of hundreds) and
they are often connected in a word. The oriental scripts like
Chinese and Japanese have a few thousand graphemic shapes (i.e.
order of thousands) each often representing a word of the text.
Thus, the character level accuracy is not meaningful in these
oriental texts. On the otherhand, character level accuracy is a
good measuring stick for European (and north American) as well as
south Asian scripts. To find a measure that can satisfy the
requirements of both European and Oriental script in a single
document is not an easy task, but future efforts should be
directed in this direction.

The commercial OCR system performance degrades very fast
with noise and disturtion in the document. Several authors have
proposed some good document degradation models for mono-script
system [7,8]. Some careful studies are needed to use these models
on multiscript systems. This is so because different scripts
have different degrees of structural complexity for the same font
size and identical style. The degradation effect due to noise on
them will be different.

Reference:
[1]. B. B. Chaudhuri and U. Pal, An OCR system to read two
Indian Language Scripts : Bangla and Devnagari (Hindi), Proc.
Int. Conf. Document Analysis and Recognition (ICDAR), pp. 1011-
1015, 1997.

[2]. U. Pal and B. B. Chaudhuri, Automatic separation of words
in Multi-lingual Multi-script Indian Document, Proc. Int. Conf.
Document Analysis and Recognition (ICDAR), pp. 576-583, 1997.

[3]. U. Pal and B. B. Chaudhuri, Script line separation from
Indian multi-script document, J. Inst. Electronics and Telecom
Engg. (Communicated).
[4]. S. V. Rice, F. R. Jenkins and T.A. Nartker, "The Fifth
annual test of OCR accuracy", Tech. Rep. TR-96-01, Information
Science Research Institute, University of Neveda, Las Vegas, N.
V. 1996.

[5]. S. Chen, S. Subramaniam and R.M.H.I.T. Phillips,
Performance evaluation of two OCR systems" in Proc. Annual Symp.
Document Analysis and Information Retrieval, pp. 299-317, 1994.

[6]. T. Kanungo and P. Resnik, "The Bible, truth, and
ultilingual OCR evaluation" in Proc. Of SPIE Conf. On Document
Recognition and Retrieval VI, D. Lopresti and Y. Zhou, eds, (San
Jose, CA), 1999.

[7]. T. Kanungo, H. S. Baird and R. M. Haralick "Validation and
estimation of document degradation modules". In Proc. Of Fourth
Annual Symp. On Document Analysis and Information Retrieval (La
Vegas, NV), April 24-26, 1995.

[8]. H.S. Baird, "Document image defect models" Structural
Document Image Analysis, Springer-Verlag, New York, 1992.