OCR Activity for Handwritten Manuscripts in Devanagari Script
      -------------------------------------------------------------
	                Prof. R. K. JOSHI
	    National Centre for Software Technology
      Gulmohar Cross Road No.9, Juhu, Mumbai 400 049 INDIA
    rkjoshi@konark.ncst.ernet.in  Tel +91 22 620 1606,  620 1574
    FAX +91 22 621 0139  URL http://konark.ncst.ernet.in/~rkjosh
  Various strokes created by writing tools to be compiled and classified in 
a dictionary of strokes for Devanagari script.  The identification of 
writing tool as used in a given manuscript is to be done.  From the root 
primitive, branch primitive and tree primitives of each letter and their 
compositions, a skeletal - seed form is to be extracted.  The folios of 
a manuscript are to be scanned and binary data obtained through OCR 
programme.  Further, the boundary extraction of each letter to produce 
a central line - a core form of each letter.  The core form of such 
scanned letter to be matched with the seed form from the dictionary of 
strokes and thus a handwritten letter is to get  recognised by OCR 
technology.
Keywords: Handwritten manuscripts, Calligraphy, writing tools, primitives, 
seedform, OCR scanning, broundary extraction matching of coreforms.
Introduction
   OCR activity on the Indian scene has been initiated since early 70's and 
has acquired a good amount of momentum during last few years [1]. The 
issues related to the character recognition by machine are being tackled 
through mainly synthesis approach, feature extraction approach and the 
boundary extraction approach.  [2]These have been tried out with fair 
amount of success incase of printed letterforms of few Indian scripts 
such as Devanagari, Bengali, Malayalam, etc.
   Due to multilingual complexities and multiscriptual environment, the 
challenges posed for machine recognition techniques have not yet been 
fully realised at a common root level of Indian letterforms and their 
shapes.  Few projects have succeeded well in character recognition of 
letterforms already digitised.  The real challenge is to draw up 
recognition parameters of analog letterforms -handprinted and handwritten 
letterforms as observed in books printed using hot metal technology and 
as found in handwritten - calligraphed manuscripts.  This activity is 
significant since many Indian texts of cultural significance are available 
in analog or handwritten form.  And they constitute the traditional 
knowledge and ancient Indian wisdom.
   In multilingual India, 18 official languages and 13 official scripts [3]
exist and some of them are in use for verbal and textual communication 
for last over one thousand years.  Further these developed scripts 
(around 1000 AD) can be tracked and traced back to 3rd Century BC to 
Mauryan period and Brahmi script,  the script which was used to write 
text in Sanskrit, Pali and other Prakrit languages including Ardhamagadhi.  
The shapes of letters of Indian alphabet have gone through evolutionary 
processes for last 2000 years.
   The exhaustive sign list of Devanagari script would include about 469 
characters [4] which can produce the syllabic versions (Barahkhadi) of 
about 5274 signs.   The less frequently used conjuncts (2, or more 
consonants put together) with maximum complex structures have been put 
on a less priority lists in the print technology.  However in handwritten 
manuscripts these complexities have been handled with clever variations 
in order to achieve authenticity as well as legibility.  OCR technology 
has to take care of these complexities into account someday or the other 
in order to make culturally significant text readable by machines.  
   Due to the vast scope of the topic this paper will limit its scope to the 
handwritten manuscript of last over 300 years from the collection of 
Asiatic Society of Bombay [5] written in Devanagari script with various 
calligraphy styles.  These are mainly in Sanskrit, Hindi and Marathi 
language.  This will provide a different type of data on the recognition 
factors of about 5000 letterforms (handwritten) in Devanagari script.
Objectives
1. To create (ground - truthed) data sets on various topics related to 
the recognition factors of Devanagari, one of the major Indian script.
2. To create a large database of root primitives, branch primitives and 
tree primitives, for each letterform and to determine their threshold 
parametric identities.
3. To evolve a concept of skeletal - seedform of  a given set of letters 
of an alphabet.  The possibility of common rules for compostion of such 
seedforms is also to be explored.
4. To plan out a strategy to use the dictionary of seedforms of handwritten 
Devanagari script to facilitate OCR activity.
Restrictions of Human Hand
   Handwriting is a product of both; a person and a pen.  This product has 
created some of the most beautiful calligraphic styles throughout the 
world.  Yet it is observed that both these elements have physical 
limitations while writing a text.  In order to reduce the infinite 
variations as they exist in handwritten text (being an individualistic 
and spatial/temporal activity), the variations produced by human hand 
and the writing tool must be viewed critically, as to why and how do 
they happen.  Through such analytical process we may come out with a 
strategy to reduce the number of variations as observed in a text by 
identifying the restrictions in the very act of writing such text.  
   Handwritten text is of two types; one is of casual/informal and 
individualistic variety and the other one is of acquired/formal and 
stylistic handwriting (calligraphy).  In both varieties, the writing 
tool and its handling by human hand play an important role.  This paper 
is concerned about the acquired/formal text as found in manuscripts 
written by professional calligraphers and its redability through the 
OCR procedure.
 
Identification of Tasks
Task 1
To identify limitations of human being in drawing a letter (character).
Task 1.a
Limitations of motorskills associated with hand movements (drawing of 
downward/upward strokes in continuous or noncontinuous way).
Task 1.b
Limitations of writing tool (thickness of the tip, characteristic of 
the tip etc.)
Task 1.c
Limitations of a writing style (a calligraphic model followed in a 
manuscript).
Task 2
To identify the range of variation, humanly possible incase of drawing 
a letter (character) and collect the threshold parameters, using a 
parental model for each letter.
Task 2.a
Variations in the height and width of a letter.
Task 2.b
Variations in the structural form of a letter.
Task 2.c
Variations in a segment  of a letter.
Task 2.d
Variations in a letter due to the adjacent elements.  (A letter preceded 
by a space or another letter or a space followed by a letter etc.).
Task 3
To identify the seedform of each letter.
Task 3.a
To weed out ornamentation, elaborate details and extended serifs etc.
Task 3.b
To define the process of assembling bare essential segments to 
formulate a seedform.[6]
The Approach
STAGE I
Step 1
To identify the writing tool as used in the manuscripts including the 
tool specifications such as, flat tip (width, thickness, angle) round tip (diameter of the tip), Flexible tip (pressure range) etc.
Step 2
To identify root and branch primitives created by an identified writing 
tool (a dot, straight line, angular line, curve line, and their combinations).  [7]
Step 3
Defining the composition of a tree primitive/s to form a letter;  in 
relationship to its zonal position ( middle zone, upper zone,  and 
lower zone further divided into four parts each). 
Step 4
To extract a seedform from the composition of the letter by identifying 
its skeletal path - the seed structure.
Thus creating a dictionary of over five thousand seedforms of Devanagari 
letters.  Some of these may be found in a given manuscript with a 
calligraphic style.  This becomes a preprocessing stage of the OCR activity.
STAGE II
Step 1
Scanning of a manuscript folio (page) through the text digitization 
process as per dpi intended, and creating binary data [8] of a folio.
Step 2
The further processes of noise cleaning, skew correction, line 
segmentation, word segmentation, zone detection, character segmentation 
are to be taken up as per OCR programming. [9]
Step 3
The boundary outline extraction of each character is to be undertaken. [10]  
The central line of the boundary outline is to be identified as a core 
form.  Thus the core form of the scanned character is to be obtained.  
STAGE III
The seedform as identified in stage 1 of a character (an outcome of a 
writing tool) is to be matched with the coreform of the scanned character 
through a search and match process from the dictionary as created in stage 1.
Conclusion
It is observed that preprocessing stage in OCR activity is avoided as far 
as possible.  But incase of handwritten text, due to its infinite varied 
possible structures, it is suggested that a different type of approach of 
identifying and analysing the writing tool should be included as a 
preprocessing stage of OCR activity in case of manuscripts.  This will 
help in
1. creating dataset of writing tools including the limitations of a 
writing tool in executing a specific writing style.
2. large database of root primitives, branch primitive and treeprimitives 
for each letter (character) of Indian scripts can be created including 
threshold paramateric identities.
It is suggested that these valuable datasets can be further extended to 
undertake the OCR activity of informal handwriting as well as printed 
material (analog) produced through hot metal technology.
POSTSRIPT
Revised Tasks Identified, in order to meet the revised scope:
Task A: To identify and strip off the ornamental/stylistic features from
a calligraphic style as observed from a given Manuscript (set of folios)
and create database of such excesses/details, as related to a specific
writing tool used in the manuscript.  Through such a process an essential/
core form can be extracted of calligraphed letter (in Devanagari script).
Such core forms will be further dovetailed into the existing OCR
programme/s of recognition of Devanagari printed letterform.
It is hoped that a manscript written in a specific Jain Calligraphic style
in Devanagari script can be treated as a prototype material for such work.
Task B: To identify variation range of a letter (casual-nonformal-non-
calligraphic) as drawn by an individual with its orthographic nuances
including angle, size, tone, joinery elements etc. and through such
process create a database of such noises which will help to draw
threshhold parameters for a given letter as to arrive at its core form,
References and Acknowledgement:
1. Early work on speech synthesis and perception/pattern recognition by 
   R. Narasimhan, V. S. N. Reddy and P. V. S. Rao at Tata Institute of 
   Fundamental Research, Mumbai, India 1974 -1978.  
   Recent work of R. M. K. Sinha, B. B. Chaudhury, U. Pal, J. C. Sant, 
   S. K. Mallick, P. S. Janardhanan et al.
2. B. B. Chaudhury, U. Pal, Recognition of printed Bangala script.             
   P. S. Janardhanan.  Issues in the development of OCR  systems for 
   Dravidian languages - Proceedings of Akshara 94., BPB Publications, 
   New Delhi, India 1994.
3. 18 Languages include 15 scheduled (VIII schedule of the Indian 
   Constitution) languages viz. Kannada, Malayalam, Tamil,  Telugu, 
   Gujarati, Marathi, Assamese, Bengali, Oriya, Hindi, Punjabi, Sanskrit, 
   Kashmiri, Sindhi, Urdu and 3 newly added languages viz. Konkani, 
   Manipuri, Nepali.
   13 scripts include Kannada, Malayalam, Tamil, Telugu, Gujarati, 
   Devanagari, Assamese, Bengali, Oriya, Gurumukhi,  Kashmiri, Sindhi, Urdu.
4. Sign lists of Devanagari script can be identified at total set level 
   and subset levels.  The total sets of about (469 characters) include 
   18 standalone vowels, 18 vowel matras, 4 vowel modifiers, 36 full 
   consonants, 36 half consonants, 257 conjuncts, 60 vedic signs for vedic 
   sanskrit, 10 numerals and 30 punctuation/other marks.  By applying 
   Barahakhadi principles (combination of pure consonant and vowel).  
   The syllabic combinations work out to be 5274 and with vowel modifiers 
   the total set of syllabic combinations work out to be 15822.  However 
   subsets required for certain tasks and application needs can be reduced 
   and can vary in quantity.  For example: for OCR activity with 3 tier 
   zonal identification approach, the signs to be parametrised will be 
   as follows:  OCR total set contains 469 (to be considered in case of 
   handwritten manuscripts).  OCR subsets can contain 217 characters (to 
   be considered printed books - analog)
5. H. D. Velankar, A descriptive catalogue of Sanskrit and Prakrit 
   manuscripts in the collection of the Asiatic Soceity of Bombay, 
   second edition 1998 edited by V. M. Kulkarni & Devangana Desai.
6. R. B. Velankar, The catalogue of manuscripts at Asiatic Society of Bombay.
7. Vinyas : Digital type font design system by National Centre for 
   Software Technology, Mumbai, India 1986.
8. Joshi R. K. - Aarekhan : a grammar of strokes, printed and published 
   by Industrial Design Centre, IIT, Mumbai 1992.
9. Fred W. H. Stentiford : Automatic Feature Design for OCR using an 
   Evolutionary Search Procedure, IEEE 1985 p349-355.
10. U. Pal and V. B. Chaudhury, Printed Devanagari Script OCR System, 
    Vivek Vol.10 No.1 Jan 1997 pp12-23, NCST,    Mumbai, India.
11. P. S. Janardhanan, Optical Character Recognition in Using Fourier
    Descriptors, ERDC.92, Proceedings of the third Annual Seminar,
    Trivandrum 1992.
Thanks to Dr. S. P. Mudur and Dinesh Shikhare of National Centre for
Software Technology, Mumbai for their comments and help.
1