NOTES FROM
MLOCR WORKING GROUPS
Workshop participants divided themselves into three groups: i) Models,
ii) Tools and Information Retrieval, and iii) Corpus and Datsets.
They discussed various issues related to multilingual OCR and
multilingual OCR evaluation. Notes and recommendations from each group
are summarized below.
MODELS FOR X Henry Baird, Moderator
TOOLS, SYSTEMS, INFORMATION RETRIEVAL
Venu Govindraju, Moderator
DATASETS/CORPUS Andrew Bagdanov, Moderator
MODERATOR: Henry Baird, Xerox PARC, USA
SCRIBE: Kris Popat, Xerox PARC, USA
PARTICIPANTS:
- Mindy Bokser, Caere Corporation, CA, USA
- Ken Cantwell, U.S. Department of Defense, USA
- Tin Ho, Bell Labs, NJ, USA
- R. K. Joshi, NCST, Bombay, INDIA
- Markus Junker, DFKI, GERMANY
- Sami Khoury, Department of National Defense, CANADA
- Koichi Kise, Osaka Perfecture University, Osaka, JAPAN
- S. P. Mudur, NCST, Bombay, INDIA
- Vidyesh Nabar, Tata Infotech, Bombay, INDIA
- Vadim Tereshchenko, ABBYY, Moscow, RUSSIA
- Ben Wittner, Scansoft, MA, USA
TOPICS DISCUSSED:
- Models of:
- Script
- Typefaces
- Language
- Typography and layout
- Image Degradation
- Genre-specific layout
- Weak contexts; e.g., dates, culture
- Types of Models:
- Probabilistic
* parametric
* non-parametric
- Deterministic
- Generative models
- Declarative (e.g. punctuation, regular expression)
vs.
Procedural (e.g., ad hoc, local)
- Assumptions vs. models
- Implicit, incomplete models
- Language Models
- Lexica
- n-grams
- Syntax taggers
- Beyond lexica, char n-grams: higher-level, syntactic context
- Taggers: gender, number, case, ...
- Models for Layout
- Specific to language, country
- Script models (not language), e.g. unbounded glyph sets
- Mixing Models
- Interactions
- Difficult, fresh topic
- Methodology and Theory
- Statistical methods for evaluating models
- Statistical methods for inferring models
- Sharing and Collecting Models
- Resource list: contacts, experts in language and script
- Sharing, e.g., speech community
- Features
- Tools for training models; resource list
- Document structure models: Physical and logical layout rules
specifics to language and culture
- Email lists; newsletters
RECOMMENDATIONS:
- Two papers we would like to read (i.e., we would like to see written):
"Multilingual Layout Analysis"
"Culture-specific Layout Rules"
- Share language models among academic participants
- Investigate ways in which knowledge from the domain
of calligraphy can inform typeface modeling for OCR
- Create a resource list of contacts who are
experts in language and script
- Use the speech recognition community as an example of
how effective sharing of models and tools can be accomplished
- Consider putting together an email list and producing
a newletter, if interest warrants
MODERATOR: Venu Govindaraju, SUNY at Buffalo, USA
SCRIBE: Atul Negi, University of Hyderabad, INDIA
PARTICIPANTS:
- Steve Dennis, U.S. Department of Defense, USA
- Andy Downton, Unversity of Essex, UK
- Rinki Gupta, Tata Infotech, INDIA
- Yang He, ScanSoft, Inc., USA
- Fumitaka Kimura, MIE University, JAPAN
- Tomohiko Morioka, ETL, JAPAN
- Rohini Srihari, SUNY at Buffalo, USA
SUMMARY/RECOMMENDATIONS:
- Data generation:
- Issues and challenges:
- Use of Microsoft Global IME for typesetting East Asian languages.
See: http://www.microsoft.com/Windows/ie/Features/ime.asp
- Standardization of punctuations and diacritic marks
- Phonetic mapping
- Transliteration of multiple words, characters and ideographs
- Environments and evaluation tools:
- Issues and challenges:
- Classes of users
- Desiderata of tools
- Design architectural network structure
- Recommendations:
- Keep broad view compatible with pattern recognition field
- Fix language pairs
- Distributed computation architecture
- Information retrieval:
- Issues and challenges
- Tagging -- levels of entities
- Language models
- Building tools and environments
- MUC style truthing standards
- Quantity of data
- Recommendations
- Fix domains: e.g. newspapers, memos, etc.
- Newspaper archives
MODERATOR: Andrew Bagdanov, University of Las Vegas, USA
SCRIBE: David Doermann, University of Maryland, USA
PARTICIPANTS:
- B. B. Chaudhuri, Indian Statistical Institute, INDIA
- Pankaj Doke, Tata Infotech, INDIA
- Robert M. Haralick, University of Washington, USA
- Tapas Kanungo, Unversity of Maryland, USA
- Ihsin Phillips, University of Washington, USA
- Larry Spitz, Document Recognition, USA
RECOMMENDATIONS:
- Create groundtruthing specifications/guidelines document
- Adopt current standards: XML + Unicode
- Test specifications by converting UW Corpus into the new
specification
- Yearly workshops for crystallizing data standards: Possibly along with
ICDAR, SPIE, and DAS conferences
- Create international consortium for organizing the creation and distribution of data.
- Create basic tools for groundtruth creation according to new standards.
- Disseminate data via the web, CDs
- Choose business model for creating and maintaining data
- Researchers pay in kind for the data by
- adding value
- adding new data
- providing quality assesment
- providing tools
- Researchers pay cash for the data
- Solicit federal funding (e.g. NSF/DARPA) for:
- workshop for creating dataset standards
- maintaining datasets
Back to MLOCR web page.
MLOCR website: http://www.cfar.umd.edu/~kanungo/workshop/mlocr.html