On Multi-script OCR System Evaluation
               -------------------------------------
 
                          B. B. CHAUDHURI 
              Computer Vision and Pattern Recognition Unit 
                    Indian Statistical Institute 
                     203 Barrackpore Trunk Road 
                      Calcutta -700 035 INDIA
                       email: bbc@isical.ac.in 
 
	There has been a growing interest in multilingual and  
multiscript OCR technology during recent years.  Usually, such  
systems are designed to read more than one script in the same  
document.  There are other systems which work with one script in a  
document, but they can be re-targetted for different script in a  
different document.  The multilingual multiscript OCR technology  
is important for business transactions across Europe and Orient.  
It has a great significance in the country like India which has  
many official state languages and scripts.  Official documents are  
often printed in three languages namely English, Hindi and the  
local state language. 
 
	However, the OCR technology on Indian scripts is not so  
mature and only a bilingual OCR system has been reported in  
recent past [1]. In that work the authors considered Hindi  
(Devnagari) and Bengali (Bangla) languages (scripts). S ome work  
on automatic identification of Indian scripts are also reported  
[2,3]. 
 
	In the international scene, many monoscript systems are  
commercially available but multiscript OCR systems are yet to be  
launched.  It is still useful and important to develop enchmarking  
and evaluation methodologies for multilingual OCR in a general  
way that may be employed to evaluate the systems developed at a  
later period.  While mono-script OCR system evaluation approach  
has been more or less stadardized over the past years,  
multiscript OCR evaluation approach is at its infancy.  The  
purpose of this article is to report our views on some aspects of  
this problem. 
 
	The OCR evaluation approaches are broadly classified into  
two categories: blackbox evaluation and whitebox evaluation.  In  
the blackbox case the OCR is considered as a blackbox where only   
the input and output are visible to the evaluator.  In a whitebox  
situation, outputs of different modules (eg. Skew correction,  
zone segmentation, character extraction) of the system may be  
accesed and evaluated. However, all systems may not be of moduler  
form and even if they are, the manufacturer may not provide  
access to individual modules. 
 
	Nevertheless, the primary issues related to both types of  
evaluation approaches are speed and accuracy of the system.  The  
accuracy is usually computed in percent of characters or words.  
Sometimes, the error is further subclassified into  
misrecognition and rejection errors. On the other hand, it can be  
subclassified into insertion, deletion and substitution errors.  
The evaluation can be made against font size and style, print and  
paper quality variations as well as noise and degradation.  The  
scanning resolution can also be a variable parameter.  Some  
typical valuation detail with these parameters are reported in  
[4,5]. 
 
	In case of multi-script OCR systems, several new problems  
Will crop up. First, the system may need to distinguish between  
regions of different scripts before activating the OCR engine of  
he  particular script. Such system has a script identification  
module as a preprocessor unit. In a whitebox environment, the  
performance of this identification module should also be tested.  
For example, we have reported an identification approach [4] that  
may work for a region as small as that occupied by a four  
character word. On the other hand, some methods need a larger  
region, occupied by several text lines. 
 
	In  a blackbox system  the relative proportion and block  
size of different scripts in a document can be used as parameters  
against which the system performance can be evaluated. Also, the  
performance error may be subcategorized into misrecognition  
across the scripts, where a character of script X may be  
misrecognized as another character of script Y. This result will  
be important if a multi-lingual OCR system does not distinguish  
separate text regions and employs a single integrated recognition  
system for the alphabets of all the scripts. 
 
  
	In a  multi-script system, another issue of consideration  
is the script complexity of different extent. For a bilingual  
situation, the alphabet of one script e.g. English (Roman) may be  
simple and small in number while the other script e.g. Hindi  
(Devnagari) may be more complex with a larger set of alphabet.  
Moreover, the writing or printing style may make the characters  
connected in a word, as in Arabic, while this is not so for most  
European language scripts. Thus, relative comparison of  
erformance across scripts is a difficult task and it is necessary  
to devise a script complexity measure for this purpose. 
	 
	In  order to evalute various systems, a standard set of  
data should be used so that the evaluation is unbiased. Kanungo   
and Resnik [6] advocated the use of the Bible for the purpose.  
They argue that Bible translations are closely parallel in  
content, relevant with respect to modern-day language and quite  
inexpensive. Other books whose translation  have similar  
properties, like Koran and Bhagavad Gita, are also suggested as  
the data set. 
 
	We belive that the statistical aspect of evaluation is  
neglected in this approach. A data set should be such that it  
reflects the global character (unigram and bigram) probability   
of occurance of the text in the language. In such a case, the  
estimated error also reflects the expected error of the OCR  
system. Suppose the character occurrence frequency of a book in  
one language reflects the probability of occurrence. But it is    
not guranted that its translation in another language with  
different alphabet will also reflect the character occurrence  
probability in that language. 
 
	Another problem of concern is for scripts with large  
alphabet as in Bangla or Hindi where combination  of several  
consonants are written in a compound character form. Some of  
these compound characters are rarely used in words. An example is  
the compound formed by three consonant combination 'ksm' which is  
used in the word 'laxmi', the name of a Hindu goddess. Such words  
cannot occur in the Bible or its translation. If such a  
translation is used to evaluate  Hindi or Bangla OCR system then  
the compound character ksm cannot be tested at all. The comment  
perhaps holds for Chiness Japanese script as well. 
 
	There are some popular and active languages in which a lot  
of document are produced, but they undergo spelling modifications  
and graphemic changes over the years An example is Bangla, that is  
the second most popular language in Indian subcontinent. If an  
old Bangla document (may be newly published) is chosen for the  
corpus and ground truth, then it may not do justice to a modern  
OCR system. This statement holds good for several Indian scripts.  
So, the dataset should be modern language based, with a full  
coverage of script alphabet of different languages and it should be  
large enough to reflect the statistical occurrence probability of  
the characters. The size and style variation of the data set  
should also be generated so that the amount of data in a  
particular size and style truly reflects how frequently they are  
used. 
 
	A systematic study of different scripts is necessary before  
designing the experiment for OCR evaluation. The popular scripts  
in the world can be grouped into three categories in terms of  
alphabet size. English and most European scripts have alphabet  
size smaller than a hundred (i.e. order of tens) and they remain  
disjoint in the running text. South Asian scripts (except Arabic)  
have a few hundred graphemic shapes (i.e. order of hundreds) and  
they are often connected in a word. The oriental scripts like  
Chinese and Japanese have a few thousand graphemic shapes (i.e.  
order of thousands) each often representing a word of the text.  
Thus, the character level accuracy is not meaningful in these  
oriental texts. On the otherhand, character level accuracy is a  
good measuring stick for European (and north American) as well as  
south Asian scripts. To find a measure that can satisfy the  
requirements of both European and Oriental script in a single  
document is not an easy task, but future efforts should be  
directed  in this direction. 
 
	The commercial OCR system performance degrades very fast  
with noise and disturtion in the document. Several authors have  
proposed some good document degradation models for mono-script  
system [7,8]. Some careful studies are needed to use these models   
on  multiscript systems. This is so because different scripts  
have different degrees of structural complexity for the same font  
size and identical style. The degradation effect due to noise on  
them will be different. 
 
Reference: 
[1].	B. B. Chaudhuri and U. Pal, An OCR system to read two  
Indian  Language Scripts :  Bangla  and Devnagari (Hindi), Proc.  
Int. Conf. Document Analysis and Recognition (ICDAR), pp.  1011- 
1015, 1997. 
 
[2].	U. Pal and B. B. Chaudhuri, Automatic separation of words  
in Multi-lingual Multi-script Indian Document,  Proc. Int. Conf.  
Document Analysis and Recognition (ICDAR), pp. 576-583, 1997.  
 
[3].	U. Pal and B. B. Chaudhuri, Script line separation from  
Indian multi-script document, J. Inst. Electronics and Telecom  
Engg. (Communicated). 
[4].	S. V. Rice, F. R. Jenkins and T.A. Nartker, "The Fifth  
annual test of OCR accuracy", Tech. Rep. TR-96-01, Information  
Science Research Institute, University of Neveda, Las Vegas, N.  
V. 1996. 
 
[5].	S. Chen, S. Subramaniam and R.M.H.I.T. Phillips,  
Performance evaluation of two OCR systems" in Proc. Annual Symp.  
Document Analysis and Information Retrieval, pp. 299-317, 1994. 
 
[6].	T. Kanungo and P. Resnik, "The Bible, truth, and  
ultilingual OCR evaluation" in Proc. Of SPIE Conf. On Document  
Recognition and Retrieval VI, D. Lopresti and Y. Zhou, eds, (San  
Jose, CA), 1999. 
 
[7].	T. Kanungo, H. S. Baird and R. M. Haralick "Validation and  
estimation of document degradation modules". In Proc. Of Fourth  
Annual Symp. On Document Analysis and Information Retrieval (La  
Vegas, NV), April 24-26, 1995. 
 
[8].	H.S. Baird, "Document image defect models" Structural  
Document Image Analysis, Springer-Verlag, New York, 1992.