Title:Offline Printed Kannada Script Recognition System

Authors: Atul Negi, B.Phani Kumar, B.K. Trinadh

Affiliation: Artificial Intelligence Lab, University of Hyderabad

Address for communication ( Atul Negi):
Artificial Intelligence Lab,
University of Hyderabad,
Gachibowli, Hyderabad 500 046 INDIA
Telephone: 91-040-3010500 extn. 4056, 4105
Fax: 91-040-3010 145, 3010 120
e-mail: atulcs@uohyd.ernet.in

Keywords: Asian Languages, South Indian Scripts, Kannada, OCR

Kannada Language:
Kannada is one of the so called Dravidian South Indian languages which has a
script and is orthographically very similar to Telugu (another Dravidian
language). While previously there have been studies on OCR systems for
Telugu, to our knowledge Kannada OCR has been so far a neglected field.
Kannada like many Indian languages is built up from a base character set of
52 characters: 16 vowels and 36 consonants. Further there are about as many
stress marks as there are base characters.Stress marks (vothus) modify the
base characters and are appendages, attached to the base characters mostly
at the top or bottom although attachment across the middle is also not ruled
out. Since the appendages can also touch the base characters the set of
distinct symbols to be recognised becomes potentially very large. The Bureau
of Indian Standards has defined an ISCII code which can be used in the
electronic representation of the language.

Our Approach:
In this presentation we take up Kannada script OCR with a corpus extracted
from the printed budget speech of the Karnataka State Government 1996-1997.

Feature Extraction and Classification: Two approaches were attempted: One is
based on the Geometric Invariant Features such as the Zernike and
Pseudo-Zernike moments. This was followed up by classification using a
Probablistic Neural Network classifier (Masters 1995). The classification
accuracy for these features was 80% and 92% respectively on a training set
extracted from the corpus.

The other approach extracted projection profiles of the base character set
in the four contour directions as one set of features. Another set of
features was by sampling the contour direction histograms as (Fumitaka 1987)
followed by modified quadratic discirminant classification as described in
the same reference. Classification accuraccy in either case was
dissappointingly low: about 50% for the same corpus.

Future Work:
This work was only at an exploratory level. More vigorous testing on larger
corpora and better training data needs to be carried out.

We are grateful to Dr. K. Narayana Murthy of Dept. of CIS, University of
Hyderabad for valuable advice, discussions and suggestions.

(Fumitaka 1987) Fumitaka et al. "Modified quadratic Discriminant Function
and application to Chinese character recognition" IEEE Trans. PAMI, pp149,
(Masters 1995) Timothy Masters, "Advanced Algorithms for Neural Networks"
John Wiley and Sons, New York, 1995.