OCR Activity for Handwritten Manuscripts in Devanagari Script
-------------------------------------------------------------

Prof. R. K. JOSHI

National Centre for Software Technology
Gulmohar Cross Road No.9, Juhu, Mumbai 400 049 INDIA
rkjoshi@konark.ncst.ernet.in Tel +91 22 620 1606, 620 1574
FAX +91 22 621 0139 URL http://konark.ncst.ernet.in/~rkjosh

Various strokes created by writing tools to be compiled and classified in
a dictionary of strokes for Devanagari script. The identification of
writing tool as used in a given manuscript is to be done. From the root
primitive, branch primitive and tree primitives of each letter and their
compositions, a skeletal - seed form is to be extracted. The folios of
a manuscript are to be scanned and binary data obtained through OCR
programme. Further, the boundary extraction of each letter to produce
a central line - a core form of each letter. The core form of such
scanned letter to be matched with the seed form from the dictionary of
strokes and thus a handwritten letter is to get recognised by OCR
technology.

Keywords: Handwritten manuscripts, Calligraphy, writing tools, primitives,
seedform, OCR scanning, broundary extraction matching of coreforms.

Introduction

OCR activity on the Indian scene has been initiated since early 70's and
has acquired a good amount of momentum during last few years [1]. The
issues related to the character recognition by machine are being tackled
through mainly synthesis approach, feature extraction approach and the
boundary extraction approach. [2]These have been tried out with fair
amount of success incase of printed letterforms of few Indian scripts
such as Devanagari, Bengali, Malayalam, etc.

Due to multilingual complexities and multiscriptual environment, the
challenges posed for machine recognition techniques have not yet been
fully realised at a common root level of Indian letterforms and their
shapes. Few projects have succeeded well in character recognition of
letterforms already digitised. The real challenge is to draw up
recognition parameters of analog letterforms -handprinted and handwritten
letterforms as observed in books printed using hot metal technology and
as found in handwritten - calligraphed manuscripts. This activity is
significant since many Indian texts of cultural significance are available
in analog or handwritten form. And they constitute the traditional
knowledge and ancient Indian wisdom.

In multilingual India, 18 official languages and 13 official scripts [3]
exist and some of them are in use for verbal and textual communication
for last over one thousand years. Further these developed scripts
(around 1000 AD) can be tracked and traced back to 3rd Century BC to
Mauryan period and Brahmi script, the script which was used to write
text in Sanskrit, Pali and other Prakrit languages including Ardhamagadhi.
The shapes of letters of Indian alphabet have gone through evolutionary
processes for last 2000 years.

The exhaustive sign list of Devanagari script would include about 469
characters [4] which can produce the syllabic versions (Barahkhadi) of
about 5274 signs. The less frequently used conjuncts (2, or more
consonants put together) with maximum complex structures have been put
on a less priority lists in the print technology. However in handwritten
manuscripts these complexities have been handled with clever variations
in order to achieve authenticity as well as legibility. OCR technology
has to take care of these complexities into account someday or the other
in order to make culturally significant text readable by machines.

Due to the vast scope of the topic this paper will limit its scope to the
handwritten manuscript of last over 300 years from the collection of
Asiatic Society of Bombay [5] written in Devanagari script with various
calligraphy styles. These are mainly in Sanskrit, Hindi and Marathi
language. This will provide a different type of data on the recognition
factors of about 5000 letterforms (handwritten) in Devanagari script.

Objectives

1. To create (ground - truthed) data sets on various topics related to
the recognition factors of Devanagari, one of the major Indian script.

2. To create a large database of root primitives, branch primitives and
tree primitives, for each letterform and to determine their threshold
parametric identities.

3. To evolve a concept of skeletal - seedform of a given set of letters
of an alphabet. The possibility of common rules for compostion of such
seedforms is also to be explored.

4. To plan out a strategy to use the dictionary of seedforms of handwritten
Devanagari script to facilitate OCR activity.

Restrictions of Human Hand

Handwriting is a product of both; a person and a pen. This product has
created some of the most beautiful calligraphic styles throughout the
world. Yet it is observed that both these elements have physical
limitations while writing a text. In order to reduce the infinite
variations as they exist in handwritten text (being an individualistic
and spatial/temporal activity), the variations produced by human hand
and the writing tool must be viewed critically, as to why and how do
they happen. Through such analytical process we may come out with a
strategy to reduce the number of variations as observed in a text by
identifying the restrictions in the very act of writing such text.

Handwritten text is of two types; one is of casual/informal and
individualistic variety and the other one is of acquired/formal and
stylistic handwriting (calligraphy). In both varieties, the writing
tool and its handling by human hand play an important role. This paper
is concerned about the acquired/formal text as found in manuscripts
written by professional calligraphers and its redability through the
OCR procedure.

Identification of Tasks

Task 1

To identify limitations of human being in drawing a letter (character).

Task 1.a

Limitations of motorskills associated with hand movements (drawing of
downward/upward strokes in continuous or noncontinuous way).

Task 1.b

Limitations of writing tool (thickness of the tip, characteristic of
the tip etc.)

Task 1.c

Limitations of a writing style (a calligraphic model followed in a
manuscript).

Task 2

To identify the range of variation, humanly possible incase of drawing
a letter (character) and collect the threshold parameters, using a
parental model for each letter.

Task 2.a

Variations in the height and width of a letter.

Task 2.b

Variations in the structural form of a letter.

Task 2.c

Variations in a segment of a letter.

Task 2.d

Variations in a letter due to the adjacent elements. (A letter preceded
by a space or another letter or a space followed by a letter etc.).

Task 3

To identify the seedform of each letter.

Task 3.a

To weed out ornamentation, elaborate details and extended serifs etc.

Task 3.b

To define the process of assembling bare essential segments to
formulate a seedform.[6]

The Approach

STAGE I

Step 1

To identify the writing tool as used in the manuscripts including the
tool specifications such as, flat tip (width, thickness, angle) round tip (diameter of the tip), Flexible tip (pressure range) etc.

Step 2

To identify root and branch primitives created by an identified writing
tool (a dot, straight line, angular line, curve line, and their combinations). [7]

Step 3

Defining the composition of a tree primitive/s to form a letter; in
relationship to its zonal position ( middle zone, upper zone, and
lower zone further divided into four parts each).

Step 4

To extract a seedform from the composition of the letter by identifying
its skeletal path - the seed structure.

Thus creating a dictionary of over five thousand seedforms of Devanagari
letters. Some of these may be found in a given manuscript with a
calligraphic style. This becomes a preprocessing stage of the OCR activity.

STAGE II

Step 1

Scanning of a manuscript folio (page) through the text digitization
process as per dpi intended, and creating binary data [8] of a folio.

Step 2

The further processes of noise cleaning, skew correction, line
segmentation, word segmentation, zone detection, character segmentation
are to be taken up as per OCR programming. [9]

Step 3

The boundary outline extraction of each character is to be undertaken. [10]
The central line of the boundary outline is to be identified as a core
form. Thus the core form of the scanned character is to be obtained.

STAGE III

The seedform as identified in stage 1 of a character (an outcome of a
writing tool) is to be matched with the coreform of the scanned character
through a search and match process from the dictionary as created in stage 1.

Conclusion

It is observed that preprocessing stage in OCR activity is avoided as far
as possible. But incase of handwritten text, due to its infinite varied
possible structures, it is suggested that a different type of approach of
identifying and analysing the writing tool should be included as a
preprocessing stage of OCR activity in case of manuscripts. This will
help in

1. creating dataset of writing tools including the limitations of a
writing tool in executing a specific writing style.

2. large database of root primitives, branch primitive and treeprimitives
for each letter (character) of Indian scripts can be created including
threshold paramateric identities.

It is suggested that these valuable datasets can be further extended to
undertake the OCR activity of informal handwriting as well as printed
material (analog) produced through hot metal technology.

POSTSRIPT

Revised Tasks Identified, in order to meet the revised scope:

Task A: To identify and strip off the ornamental/stylistic features from
a calligraphic style as observed from a given Manuscript (set of folios)
and create database of such excesses/details, as related to a specific
writing tool used in the manuscript. Through such a process an essential/
core form can be extracted of calligraphed letter (in Devanagari script).
Such core forms will be further dovetailed into the existing OCR
programme/s of recognition of Devanagari printed letterform.

It is hoped that a manscript written in a specific Jain Calligraphic style
in Devanagari script can be treated as a prototype material for such work.

Task B: To identify variation range of a letter (casual-nonformal-non-
calligraphic) as drawn by an individual with its orthographic nuances
including angle, size, tone, joinery elements etc. and through such
process create a database of such noises which will help to draw
threshhold parameters for a given letter as to arrive at its core form,

References and Acknowledgement:

1. Early work on speech synthesis and perception/pattern recognition by
R. Narasimhan, V. S. N. Reddy and P. V. S. Rao at Tata Institute of
Fundamental Research, Mumbai, India 1974 -1978.
Recent work of R. M. K. Sinha, B. B. Chaudhury, U. Pal, J. C. Sant,
S. K. Mallick, P. S. Janardhanan et al.

2. B. B. Chaudhury, U. Pal, Recognition of printed Bangala script.
P. S. Janardhanan. Issues in the development of OCR systems for
Dravidian languages - Proceedings of Akshara 94., BPB Publications,
New Delhi, India 1994.

3. 18 Languages include 15 scheduled (VIII schedule of the Indian
Constitution) languages viz. Kannada, Malayalam, Tamil, Telugu,
Gujarati, Marathi, Assamese, Bengali, Oriya, Hindi, Punjabi, Sanskrit,
Kashmiri, Sindhi, Urdu and 3 newly added languages viz. Konkani,
Manipuri, Nepali.
13 scripts include Kannada, Malayalam, Tamil, Telugu, Gujarati,
Devanagari, Assamese, Bengali, Oriya, Gurumukhi, Kashmiri, Sindhi, Urdu.

4. Sign lists of Devanagari script can be identified at total set level
and subset levels. The total sets of about (469 characters) include
18 standalone vowels, 18 vowel matras, 4 vowel modifiers, 36 full
consonants, 36 half consonants, 257 conjuncts, 60 vedic signs for vedic
sanskrit, 10 numerals and 30 punctuation/other marks. By applying
Barahakhadi principles (combination of pure consonant and vowel).
The syllabic combinations work out to be 5274 and with vowel modifiers
the total set of syllabic combinations work out to be 15822. However
subsets required for certain tasks and application needs can be reduced
and can vary in quantity. For example: for OCR activity with 3 tier
zonal identification approach, the signs to be parametrised will be
as follows: OCR total set contains 469 (to be considered in case of
handwritten manuscripts). OCR subsets can contain 217 characters (to
be considered printed books - analog)

5. H. D. Velankar, A descriptive catalogue of Sanskrit and Prakrit
manuscripts in the collection of the Asiatic Soceity of Bombay,
second edition 1998 edited by V. M. Kulkarni & Devangana Desai.

6. R. B. Velankar, The catalogue of manuscripts at Asiatic Society of Bombay.

7. Vinyas : Digital type font design system by National Centre for
Software Technology, Mumbai, India 1986.

8. Joshi R. K. - Aarekhan : a grammar of strokes, printed and published
by Industrial Design Centre, IIT, Mumbai 1992.

9. Fred W. H. Stentiford : Automatic Feature Design for OCR using an
Evolutionary Search Procedure, IEEE 1985 p349-355.

10. U. Pal and V. B. Chaudhury, Printed Devanagari Script OCR System,
Vivek Vol.10 No.1 Jan 1997 pp12-23, NCST, Mumbai, India.

11. P. S. Janardhanan, Optical Character Recognition in Using Fourier
Descriptors, ERDC.92, Proceedings of the third Annual Seminar,
Trivandrum 1992.

Thanks to Dr. S. P. Mudur and Dinesh Shikhare of National Centre for
Software Technology, Mumbai for their comments and help.

1