NOTES FROM 
MLOCR  WORKING GROUPS 
Workshop participants divided themselves into three groups: i) Models,
ii) Tools and Information Retrieval, and iii) Corpus and Datsets.
They discussed various issues related to multilingual OCR and
multilingual OCR evaluation. Notes and recommendations from each group
are summarized below. 
 MODELS FOR X  Henry Baird, Moderator
 TOOLS, SYSTEMS, INFORMATION RETRIEVAL  
		Venu Govindraju, Moderator
 DATASETS/CORPUS Andrew Bagdanov, Moderator
 MODERATOR:   Henry Baird, Xerox PARC, USA 
 SCRIBE:  Kris Popat, Xerox PARC, USA 
 PARTICIPANTS:  
  - Mindy Bokser, Caere Corporation, CA, USA
  
- Ken Cantwell, U.S. Department of Defense, USA
  
- Tin Ho, Bell Labs, NJ, USA
  
- R. K. Joshi, NCST, Bombay, INDIA
  
- Markus Junker, DFKI, GERMANY
  
- Sami Khoury, Department of National Defense, CANADA
  
- Koichi Kise, Osaka Perfecture University, Osaka, JAPAN
  
- S. P.  Mudur, NCST, Bombay, INDIA
  
- Vidyesh Nabar, Tata Infotech, Bombay, INDIA
  
- Vadim Tereshchenko, ABBYY, Moscow, RUSSIA
  
- Ben Wittner, Scansoft, MA, USA
TOPICS DISCUSSED:
-  Models of:
   
   -  Script
   
-  Typefaces
   
-  Language
   
-  Typography and layout
   
-  Image Degradation
   
-  Genre-specific layout
   
-  Weak contexts; e.g., dates, culture
   
 
-  Types of Models:
   
   -  Probabilistic
       * parametric
       * non-parametric
   
-  Deterministic
   
-  Generative models
   
-  Declarative (e.g. punctuation, regular expression)
     vs.     
     Procedural (e.g., ad hoc, local)
   
-  Assumptions vs. models
   
-  Implicit, incomplete models
   
 
-  Language Models
   
   -  Lexica
   
-  n-grams
   
-  Syntax taggers
   
-  Beyond lexica, char n-grams: higher-level, syntactic context
   
-  Taggers: gender, number, case, ...
   
 
-  Models for Layout
   
   -  Specific to language, country
   
-  Script models (not language), e.g. unbounded glyph sets
   
 
-  Mixing Models
   
    -  Interactions
    
-  Difficult, fresh topic
   
 
-  Methodology and  Theory
   
   -  Statistical methods for evaluating models
   
-  Statistical methods for inferring models
   
 
-  Sharing and Collecting Models
   
   -  Resource list: contacts, experts in language and script
   
-  Sharing, e.g., speech community
   
-  Features
   
-  Tools for training models; resource list
   
-  Document structure models: Physical and logical layout rules
          specifics to language and culture
   
-  Email lists; newsletters
   
 
RECOMMENDATIONS:
  -  Two papers we would like to read (i.e., we would like to see written):
    "Multilingual Layout Analysis"
    "Culture-specific Layout Rules"
  
-  Share language models among academic participants
  
-  Investigate ways in which knowledge from the domain
    of calligraphy can inform typeface modeling for OCR
  
-  Create a resource list of contacts who are
    experts in language and script
  
-  Use the speech recognition community as an example of
    how effective sharing of models and tools can be accomplished
  
-  Consider putting together an email list and producing
    a newletter, if interest warrants
  
 MODERATOR:  Venu Govindaraju, SUNY at Buffalo, USA  
 SCRIBE:  Atul Negi, University of Hyderabad, INDIA  
 PARTICIPANTS: 
-   Steve Dennis, U.S. Department of Defense, USA
-   Andy Downton, Unversity of Essex, UK
-   Rinki Gupta, Tata Infotech, INDIA
-   Yang He, ScanSoft, Inc., USA
-   Fumitaka Kimura, MIE University, JAPAN
-   Tomohiko Morioka, ETL, JAPAN
-   Rohini Srihari, SUNY at Buffalo, USA
SUMMARY/RECOMMENDATIONS:                 
-  Data generation:
    
    -  Issues and challenges:
        
        -  Use of Microsoft Global IME for typesetting East Asian languages.
           See: http://www.microsoft.com/Windows/ie/Features/ime.asp
        
-  Standardization of punctuations and diacritic marks
        
-  Phonetic mapping
        
-  Transliteration of multiple words, characters and  ideographs
        
 
 
-  Environments and evaluation tools:
    
    -  Issues and challenges:
        
        -  Classes of users
        
-  Desiderata of tools
        
-  Design architectural network structure
        
 
-  Recommendations:
        
        -  Keep broad view compatible with pattern recognition field
        
-  Fix language pairs
        
-  Distributed computation architecture
        
 
 
-  Information retrieval:
    
    -  Issues and challenges
        
        -  Tagging -- levels of entities
        
-  Language models
        
-  Building tools and environments
        
-  MUC style truthing standards
        
-  Quantity of data
        
 
-  Recommendations
        
        -  Fix domains: e.g. newspapers, memos, etc.
        
-  Newspaper archives 
        
 
 
 MODERATOR:  Andrew Bagdanov, University of Las Vegas, USA  
 SCRIBE:  David Doermann, University of Maryland, USA 
 PARTICIPANTS: 
  
  -  B. B.  Chaudhuri, Indian Statistical Institute, INDIA 
  
-  Pankaj Doke, Tata Infotech, INDIA
  
-  Robert M. Haralick, University of Washington, USA
  
-  Tapas Kanungo, Unversity of Maryland, USA
  
-  Ihsin Phillips, University of Washington, USA
  
-  Larry Spitz, Document Recognition, USA
  
RECOMMENDATIONS:
 -  Create groundtruthing specifications/guidelines document
    
      -  Adopt current standards: XML + Unicode
      
-  Test specifications by converting UW Corpus into the new 
        specification 
     
 
-  Yearly workshops for crystallizing data standards: Possibly along with 
   ICDAR, SPIE, and DAS conferences
 
-  Create international consortium for organizing the creation and distribution of data.
 
-  Create basic tools for groundtruth creation according to new standards. 
 
-  Disseminate  data via the web, CDs
 
-  Choose business model for creating and maintaining data 
    
    -  Researchers pay in kind for the data by
       
       -  adding value
       
-  adding new data
       
-  providing quality assesment
       
-  providing tools  
       
 
-  Researchers pay cash for the data
    
 
-  Solicit federal funding (e.g. NSF/DARPA) for:
    
    -   workshop for creating dataset standards
    
-  maintaining datasets
    
 
Back to MLOCR web page. 
MLOCR website: http://www.cfar.umd.edu/~kanungo/workshop/mlocr.html