-------------------------------------------------------------

Robert M. HARALICK

Department of Electrical Engineering

University of Washington, Seattle, WA 98195 USA

haralick@ee.washington.edu Tel. +1-206-685-4974

Evaluating the performance of document image analysis algorithms is not

simple. In the general case where there is a performance requirement

specification, a point estimate of performance is not sufficient. Yet

point estimates of performance seem to predominate in papers discussing

performance or evaluation. In this paper, we sketch out the elements

of doing any kind of document image analysis performance evaluation.

The context in which performance evaluation is done is the controlled

experiment. Controlled experiments provide evidence that the algorithm

segments, recognizes , locates, and/or measures what it is designed to do

from image data. A properly designed scientific experiment provides

evidence to accept or reject the hypothesis that the algorithm performs

to the specified accuracy level, as given in the requirements specification.

To properly set up such an experiment, in a way that it can be repeated

and the evidence verified by another researcher, considerable attention

must be paid to experimental protocol.

The experimental protocol states the quantities to be estimated, how

they will be estimated, the accuracy to which they will be estimated,

and the population of document images on which the document image algorithm

is to be applied, and how the tuning parameters of the algorithm are to

be set or varied. Then the protocol must give the experimental design and

the data analysis plan.

The experimental design describes how a suitably random, independent, and

representative set of images from the specified population is to be sampled,

generated, or acquired. If the population includes, for example, a range of

sizes or orientations of the characters or zones or layouts or if the document

image objects of interest can appear in a variety of contextual situations,

then the sampling mechanism must assure that a reasonable number of images

are sampled with the document image objects appearing in each of the

different sizes or appearing with sizes and orientations throughout

its permissible range.

In general, the essential variation of the images in the population can

be described by some number, say N variables. If the N variables

X_1,...,X_N have respective range sets R_1,...,R_N, then the sampling

design must assure that images sample the domain

R_1 x R_2 x ... R_N

in a representative way. The number of samples may only be a small fraction

of the number of possibilities. This suggests that the experimental design

may have to make judicious use of a latin square layout. If there are some

free tuning parameters that must be set a priori in the document image

algorithm, then these parameters can as well be part of the N vaiables,

in which case, the statistical analysis will be conditioned on their

values and the experimental results can plot performance as a function

of their values.

The experiments must then be carried out for each image in the sample.

Suppose there are M different measurements Y_1,...,Y_M a document image

algorithm is to make on each image. These variables would be a subset

of the N variables which might describe the images in the population of

interest. As a result of these measurements, which are really statistical

inferences, there will be a difference between the true values y_1,...,y_M

of the measured quantities and the measured values themselves. The accuracy

criterion must state how the comparision between the true values and the

measured values will be evaluated.

Finally, the exerimental data analysis plan must state how the hypothesis

that the algorithm meets the specified requirement will be tested. It

indicates how the observed data (the true values and the corresponding

measured values) will be analyzed. The plan must be supported by

theoretically developed statistical analysis which shows that an experiment

carried out according to the experimental design and analyzed according

to the data analysis plan will produce a statistical test itself having

a given accuracy. That is, since the entire population was only sampled,

the sampling fluctuation will introduce random fluctuation in the test

results. For some fraction of experiments carried out according to the

protocol, the hypothesis to be tested will be accepted but if the algorithm

were to be tried on the complete population of image variations, it would

not meet the specified requirements; and for some fraction of experiments

carried out according to the protocol, the hypotheis to be tested will

be rejected but if the algorithm were to be tried on the complete population

of image variations, it would meet the specified requirements.

The specified size of these errors of false acceptance and missed

acceptance will dictate the number of images to be in the sample. This

relation between sample size and false acceptance rate and missed

acceptance rate of the test for the hypothesis must be determined on the

basis of statistical theory. One would certainly expect that the sample

size would be large enough so that these error rates would be below 20%.

For example, if the error rate of a document image algorithm is to be

less than 1/1000, then in order to be about 85% sure that the performance

meets specification, 10,000 tests will have to be run. If the document

image algorithm performs incorrectly 9 or fewer times, then we can assert

that with 85% probability, the document image algorithm meets specification

(Haralick, 1989).

In summary, any experimental protocol must include the following items.

1. The protocol must state:

- what is to be measured on each experimental trial

- what is to be inferred (estimated) from the measurements

or what hypothesis is to be tested (the requirement)

- what is the theoretically expected behavior of what

is to be estimated

- with what precision (standard deviation) is the inferred

quantity to be estimated or with what misdetect and false

alarm error is the hypothesis to be tested

- over what population of document images and algorithm

tuning parameters are the experiments to be done

2. The protocol must layout an experimental design which describes

- how a suitable random, independent, and representative set

of images from the specified population is to be sampled,

generated, or acquired, including the setting of all parameters

relevant to the population

- what parameters of the algorithm will be varied and

how their values will be varied

- what parameters of the algorithm will be set and what values

they will be set at

- the experiments which must carried out for each image

in the population

- the accuracy criterion which states how the comparison between

the true values and the measured values will be evaluated

3. The protocol must have a data analysis plan which

- states how to statistically compare the behavior

of what is estimated and the theoretically expected behavior

of what is estimated to verify the hypothesis that the

observed behavior is consistent with the theoretical expectations

- states how to test the hypothesis that the algorithm meets the

specified requirement

- indicates how the data (the true values and the corresponding

measured values) will be analyzed

- tells explicitly what performance curves will be generated

4. The data analysis plan

- must be supported by a theoretically developed statistical

analysis; and

- show that an experiment carried out according to the experimental

design and analyzed according the data analysis plan will

produce results having a given accuracy (i.e., the rates of

false alarms and miss detections due to the experimental

sampling variation meet the stated requirement)