Benchmarking German and English OCR Systems

Stefan AGNE
(presented by Markus JUNKER)

German Research Center for Artificial Intelligence (DFKI) GmbH
P.O. Box 2080, D-67608 Kaiserslautern, GERMANY Tel. (+49) 631-205-3584 FAX (+49) 631-205-3210


In the field of benchmarking of OCR systems, DFKI works in two
research areas:
- text based evaluation of document page segmentation systems and
- character and word based evaluation of text recognition systems.


The decomposition of a document into segments such as text regions and
graphics is a significant field in the document analysis process. The
basic requirement for rating and improvement of page segmentation
algorithms is a systematical evaluation. The approaches known from
literature have the common disadvantage that manually generated
reference data (zoning ground truth) is needed for the evaluation
task. The effort and cost for the creation of appropriate ground
truth is high.

At the DFKI, the evaluation system SEE has been developed. The system
requires the OCR generated text and the original text of the document
in correct reading order (text ground truth) as input. The implicit
structure information which is contained in the text ground truth is
used for the evaluation of the automatic zoning. Therefore, a mapping
of text regions in the text ground truth to the corresponding regions
in the OCR generated text is computed (matches). A fault tolerant
string matching algorithm is used to obtain a method which tolerates
OCR errors in the text.

The occuring segmentation errors are determined as a result of the
evaluation of the matching. Subsequently, the edit operations
(insertion, substitution, and deletion of character) which are
necessary for the correction of the recognized segmentation errors are
computed to estimate the correction costs. However, first tests have
revealed promising results regarding the quality of the evaluation.


To evaluate text recognition systems we compare the generated text
(OCR text) with the correct text (text ground truth) by computing the
edit distance (Levenshtein distance) between both texts. As a result
of this comparison we get the necessary minimum number of edit
operations to correct the OCR text.

Based on these techniques we have developed a tool to compute the
following character based evaluation measures:
- character accuracy
- number of errors (insertions, substitutions, and deletions)
- accuracy by character class

Furthermore we have developed a tool to compute several word based
evaluation measures:
- word accurracy
- number of misrecognized words
- stopword accurracy and non-stopword accurracy
- distinct non-stopword accurracy
- phrase accuracy

The exact defintions of the listed measures are shown for example in
the description of the fifth and last annual test of OCR accuracy at
the ISRI in the year 1996.

Currently, commercial recognizers can provide information about the
confidence of a recognized word (e.g. the Xerox ScanWorX XDOC format),
but usually they hardly provide character alternatives. For example
Recore from NewSoft, Inc., USA outputs alternatives and EasyReader
from Mimetics, France provides up to three alternatives.

The structure to represent alternatives in a single character place
are called character hypothesis lattices (CHL). An example of a CHL is
shown in the following:

---------- Example for character hypothesis lattices (CHL) -----------

(c 999)
(0 456) (o 198)
(m 517)
(q 500) (p 500)
(n 334) (u 333) (h 247)
(t 818)
(e 1000)
(r 734)
--------------------------- End of example ---------------------------

Each place in the CHL denotes a possible recognized character
augmented with an evidence measure. In general, an OCR engine takes
the maximum of such a choice and presents it as the recognized
character (assumption: no contextual postprocessing). A common
observation is that the real character --- when not being recognized
as first choice --- is likely to be recognized as second or third best
alternative. However, under certain circumstances a character is not
recognized as one single character or vice versa. This is due to an
"incorrect" character segmentation procedure on noisy data.

In regard to the CHLs we have extended the functionality of our
evaluation tools. Now we compute the character accuracy for a certain
depth, whereby the depth determins the number of character
alternatives taken into account. For example, if we choose 3 as the
depth, then we say the character has been correctly recognized, if one
of the first three alternatives is equal to the ground truth

In the following, the first part of the output of our character based
evaluation tool is shown.

------------------- Output of the evaluation tool --------------------

DFKI Votes Accuracy Tables Version 1.0
917 Characters total in Ground Truth
1.16 Votes-alternatives per Character (for 801 Non-space characters)

Total Accuracy Table:
| Depth | 1 | 2 | 3 | 4 | >4 |
| Errors | 44 | 32 | 31 | 30 | 29 |
| Accuracy | 95.20 | 96.51 | 96.62 | 96.73 | 96.84 |
--------------------------- End of example ---------------------------

For the evaluation of multilingual OCR, we can think of a series of
further extensions. For example, we can extend the measure "accuracy
by character class" about further language specific character classes.
For example, a German character class within the German umlauts
"". Similar extension are possible for the word based
evaluation by using language specific stopword lists. The suggested
extensions are easy to realize.


Primarily the DFKI deals with German and English documents. Usally we
have one language per document.


For our tests we use several datasets:
- English Business Letter Sample (ISRI)
- German Business Letter Sample (ISRI)
- several internal datasets:
+ German Facsimile Sample
+ German Magazine Sample
+ Further German Business Letter Samples