NOTES FROM MLOCR WORKING GROUPS

Workshop participants divided themselves into three groups: i) Models, ii) Tools and Information Retrieval, and iii) Corpus and Datsets. They discussed various issues related to multilingual OCR and multilingual OCR evaluation. Notes and recommendations from each group are summarized below.

MODELS FOR X Henry Baird, Moderator
TOOLS, SYSTEMS, INFORMATION RETRIEVAL Venu Govindraju, Moderator
DATASETS/CORPUS Andrew Bagdanov, Moderator

MODELS FOR X

MODERATOR: Henry Baird, Xerox PARC, USA
SCRIBE: Kris Popat, Xerox PARC, USA
PARTICIPANTS:

Mindy Bokser, Caere Corporation, CA, USA
Ken Cantwell, U.S. Department of Defense, USA
Tin Ho, Bell Labs, NJ, USA
R. K. Joshi, NCST, Bombay, INDIA
Markus Junker, DFKI, GERMANY
Sami Khoury, Department of National Defense, CANADA
Koichi Kise, Osaka Perfecture University, Osaka, JAPAN
S. P. Mudur, NCST, Bombay, INDIA
Vidyesh Nabar, Tata Infotech, Bombay, INDIA
Vadim Tereshchenko, ABBYY, Moscow, RUSSIA
Ben Wittner, Scansoft, MA, USA

TOPICS DISCUSSED:

Models of:
- Script
- Typefaces
- Language
- Typography and layout
- Image Degradation
- Genre-specific layout
- Weak contexts; e.g., dates, culture
Types of Models:
- Probabilistic * parametric * non-parametric
- Deterministic
- Generative models
- Declarative (e.g. punctuation, regular expression) vs. Procedural (e.g., ad hoc, local)
- Assumptions vs. models
- Implicit, incomplete models
Language Models
- Lexica
- n-grams
- Syntax taggers
- Beyond lexica, char n-grams: higher-level, syntactic context
- Taggers: gender, number, case, ...
Models for Layout
- Specific to language, country
- Script models (not language), e.g. unbounded glyph sets
Mixing Models
- Interactions
- Difficult, fresh topic
Methodology and Theory
- Statistical methods for evaluating models
- Statistical methods for inferring models
Sharing and Collecting Models
- Resource list: contacts, experts in language and script
- Sharing, e.g., speech community
- Features
- Tools for training models; resource list
- Document structure models: Physical and logical layout rules specifics to language and culture
- Email lists; newsletters

RECOMMENDATIONS:

Two papers we would like to read (i.e., we would like to see written): "Multilingual Layout Analysis" "Culture-specific Layout Rules"
Share language models among academic participants
Investigate ways in which knowledge from the domain of calligraphy can inform typeface modeling for OCR
Create a resource list of contacts who are experts in language and script
Use the speech recognition community as an example of how effective sharing of models and tools can be accomplished
Consider putting together an email list and producing a newletter, if interest warrants

TOOLS, SYSTEMS, INFORMATION RETRIEVAL

MODERATOR: Venu Govindaraju, SUNY at Buffalo, USA
SCRIBE: Atul Negi, University of Hyderabad, INDIA
PARTICIPANTS:

Steve Dennis, U.S. Department of Defense, USA
Andy Downton, Unversity of Essex, UK
Rinki Gupta, Tata Infotech, INDIA
Yang He, ScanSoft, Inc., USA
Fumitaka Kimura, MIE University, JAPAN
Tomohiko Morioka, ETL, JAPAN
Rohini Srihari, SUNY at Buffalo, USA

SUMMARY/RECOMMENDATIONS:

Data generation:
- Issues and challenges:
  - Use of Microsoft Global IME for typesetting East Asian languages. See: http://www.microsoft.com/Windows/ie/Features/ime.asp
  - Standardization of punctuations and diacritic marks
  - Phonetic mapping
  - Transliteration of multiple words, characters and ideographs
Environments and evaluation tools:
- Issues and challenges:
  - Classes of users
  - Desiderata of tools
  - Design architectural network structure
- Recommendations:
  - Keep broad view compatible with pattern recognition field
  - Fix language pairs
  - Distributed computation architecture
Information retrieval:
- Issues and challenges
  - Tagging -- levels of entities
  - Language models
  - Building tools and environments
  - MUC style truthing standards
  - Quantity of data
- Recommendations
  - Fix domains: e.g. newspapers, memos, etc.
  - Newspaper archives

CORPUS/DATASETS

MODERATOR: Andrew Bagdanov, University of Las Vegas, USA
SCRIBE: David Doermann, University of Maryland, USA
PARTICIPANTS:

B. B. Chaudhuri, Indian Statistical Institute, INDIA
Pankaj Doke, Tata Infotech, INDIA
Robert M. Haralick, University of Washington, USA
Tapas Kanungo, Unversity of Maryland, USA
Ihsin Phillips, University of Washington, USA
Larry Spitz, Document Recognition, USA

RECOMMENDATIONS:

Create groundtruthing specifications/guidelines document
- Adopt current standards: XML + Unicode
- Test specifications by converting UW Corpus into the new specification
Yearly workshops for crystallizing data standards: Possibly along with ICDAR, SPIE, and DAS conferences
Create international consortium for organizing the creation and distribution of data.
Create basic tools for groundtruth creation according to new standards.
Disseminate data via the web, CDs
Choose business model for creating and maintaining data
- Researchers pay in kind for the data by
  - adding value
  - adding new data
  - providing quality assesment
  - providing tools
- Researchers pay cash for the data
Solicit federal funding (e.g. NSF/DARPA) for:
- workshop for creating dataset standards
- maintaining datasets

Back to MLOCR web page.

MLOCR website: http://www.cfar.umd.edu/~kanungo/workshop/mlocr.html