Short Course on Optical Character Recognition
Short Course on Optical Character Recognition
October 26,28,30,- 1998
Tapas Kanungo
Direction to the
course
Optical Character Recognition (OCR) algorithms take as input
a scanned image of paper document and produce as output a symbolic
text document (e.g. ASCII, Word, or HTML). Text produced by OCR
algorithms can be searched and indexed by information retrieval
algorithms.
Although researchers have worked on the problem of OCR for atleast
thirty years, there has been a renewed interest in OCR
technology in the recent years. This is partly due to:
i) the increasing need for efficient information storage and
retrieval,
ii) the increasing need for cross-language information access, and
iii)the dramatic drop in scanner prices.
The purpose of this course is to teach the internals of an
OCR system. Much of the time will be spent on OCR systems
that are based on hidden Markov models (HHMs). The labs will allow you
to experiment with sub-modules of OCR systems. No programming
experience is necessary for the labs. Reading material will be
provided at the course site.
Tentative course outline:
Day 1
Day 2
Day 3
- Hidden Markov Models (contd.):
- Connected words, two level dynamic programming
- HMM training
- OCR algorithm summary:
- Topics in OCR:
The following sub-topics in OCR will be discussed
briefly.
- Degraded documents
- Logical structure extraction
- Colored/textured background
- Evaluation of segmentation results
- Use of linguistic resources
- Voting OCR
- Tables, maps, line drawings, music
- Language identification, multilingual OCR
- Handwriting recognition: online and offline
- Duplicate documents
- Document categorization and routing
- Sensitive word/document detection and redaction
- OCR from camera/video
- Information retrieval from noisy OCR'd documents
- Retrieval without OCR
LAB: HMM OCR continued.
Last modified October 20, 1998.