A Survey of Indian Script OCR Systems
-------------------------------------
Pankaj DOKE, Rinki GUPTA, Vidyesh NABAR
Applied Technology Group, Tata Infotech Ltd., INDIA
{pankaj.doke,rinki.gupta,vidyesh.nadar}@tatainfotech.com
Dr. R. K. SHEVGAONKAR
Electrical Engineering Department
Indian Institute of Technology, Mumbia, INDIA
rks@ee.iitb.ernet.in
In this paper we bring out the characteristics of Bangla and Devanagari
character recognition systems reported in the Indian subcontinent.
The research in Devanagari character recognition has centers of activity
at TIFR, ISI, IIT, CDAC and Tata Infotech Ltd. The various approaches
reported to date have deployed a structural approach using feature
primitives that constitute the building blocks of the character. Almost
all have a used a syntactic tree classifier, post feature extraction for
the recognition phase.
The basic alphabet set of Bangla Script is very large comprising of 11
vowels, 39 consonants and 4 shape modifiers. The number of symbols is
approximately 300.
The Bangla Character Recognition System developed by B.B Chaudhuri and Pal
[Chaudhuri and Pal, 1995] is capable of handling the character recognition
of machine printed, single font, single size Bangla script and offers no
support for hand written characters. The accuracy achieved by this system
is approximately 96%. Preprocessing involves tilt correction (which can
handle tilts up to 10 degrees), followed by noise removal, and preliminary
segmentation of the input image into sentences, zones and letters. However
this system differs markedly from other OCR systems by not using
thinning/skeletonization. A combination of feature and template matching
is employed for recognition. There are eight stroke based features at
angles (0,90,180,+-45) and a filled circle representing a dot. The
character recognition of the simple characters is done using a feature
based tree classifier, whereas compound characters are recognized using
template matching preceded by feature based grouping. Also, some character
occurrence statistics have been utilized to aid the recognition process.
Devanagari script (script for Sanskrit, Hindi, Marathi and Pali languages)
is a moderately complex pattern. Unlike simple juxtaposition in Roman
script, a word in Devanagari script is composed of composite characters
joined by a horizontal line at the top. The basic alphabet set of
Devanagari is very large comprising of about 13 vowels, 34 consonants and
14 matras. The number goes up once half letter forms are also considered.
This makes the development of an OCR for Devanagari, the most popular
script in the Indian sub-continent, a highly challenging task. Some
reports are available on the various approaches for Devanagari Character
Recognition [Sinha and Mahabala, 1979], [Chaudhuri and Pal, 1995] [Kshitij
Sharma, 1997]
Work reported in [Kshitij Sharma, 1997] is based on a syntactic pattern
recognition approach. This system is aimed at the recognition of machine
printed characters and claims an accuracy of about 90%. Standard
pre-processing techniques have been used in this system. This is followed
by the feature extraction for which the image of each letter is
partitioned into nine equal regions by placing a 3x3 grid over it. This
grid is uniformly distributed over the whole image, and no scaling to a
standard size is required. Each section of the grid is then searched for
four primitives (features): 1. Joint points 2. Crosses 3. Tails and 4.
Isolated points. The primitives from each section are then coded to form a
108-element binary feature vector. This vector is then matched with stored
description for recognition. A neural network like approach has been used
for feature matching and classification. The above system recognizes the
basic characters and matras/modifiers separately. The information has to
be put together to get the desired output.
Chaudhuri and Pal, ISI Calcutta presented an OCR system for printed
Devanagari script which claims an accuracy of 95% at the character level.
In this system some standard and some new techniques have been used for
preprocessing. However, thinning has not been carried out on the images.
>From zonal information and shape characteristics, the basic, modified and
compound characters are separated for convenience of classification.
Modified and basic characters are recognized by a structural feature based
binary tree classifier while the compound characters are aimed to be
recognized by a hybrid approach.
The system makes use of a principal set of simple stroke based features
for the classification purposes. The features are searched in the middle
zone of the basic character, which has not been divided into any further
sections. Also, at some nodes of the classifier other additional features
(like number of crossings, projection profile) are used to further
differentiate between a subgroup/class of characters at that node. To
detect the modified characters, the upper and lower zones are inspected
The system developed at TIFR by [Sinha and Mahabala, 1979] uses a novel
approach of a syntactic classifier with an embedded picture language
"PLANG" (picture language for a class of pictures) to solve the problem of
Devanagari ICR. The system has been designed to handle both hand written
and machine printed characters. The system can be conceptually divided
into 2 primary modules namely: a) The Primitive Identification module and
b) Picture Language Monitor.
The Primitive Identification module deploys standard image processing
algorithms for pre-processing. The system is extensible to recognize
variants of the basic character shape.
Each character is associated with a PLANG sentence corresponding to a
picture frame. The PLANG sentence' components are, Primitives: which are
terminal symbols of the language, Partitioning functions: which partition
the picture frame to extract the regions, essentially, a N x N grid,
Macros: which are the non-terminal symbols of the language, which can be
expressed in terms of other non-terminals or itself or terminals, Frame
operations: which comprise the set of operations (append and superimpose)
on the picture frames to generate the characters.
To increase the accuracy of the system and reduce the computational costs,
contextual information regarding the occurrences of certain primitives and
their combinations and their restrictions are also used.
Testing of the Indian script ICR/OCR systems has not been exhaustive as
there exists a lack of standard test databases (ground truthed data) of
the Indian languages. Also, there is a lack of any statistical analysis or
a government approved/recommended fonts and/or databases. Any effort
invested towards these activities could go a long way towards furthering
the research and commercial systems in recognition of machine and hand
printed Indian languages.
The use of techniques from the AI community to build robust classifiers
and learning systems could greatly aid the quality of recognition systems.