Distributed Evaluation For Multi-lingual OCR Algorithms:
--------------------------------------------------------
Problem Analysis and User Desiderata
------------------------------------

S. M. LUCAS and A. C. DOWNTON

Department of Electronic Systems Engineering
University of Essex, Colchester, Essex CO4 3SQ, UK
{sml,acd}@essex.ac.uk Tel. +44 1206 872935/872910 Fax: +44 1206 872910

1. INTRODUCTION

The need for objective automatic systems for evaluating the quality of
software algorithms stems from a simple observation: that it is much harder
than it should be to evaluate and scientifically choose the best of a
number of alternative software components or algorithms for a given task.
Currently, if one is buying a new car or any other consumer product, there
are a wealth of objective (and many more subjective) data readily available
that may be used to inform the decision process. For many classes of
algorithm, this is not the case; evaluating the best alternative is a
tedious and time-consuming process, with the result that it is generally a
neglected aspect of research and development. Multilingual OCR is only one
domain from within the wide range of pattern analysis and machine
intelligence algorithms for which automatic evaluation is required;
therefore any proposed solution needs to address the more general field as
well as the specific domain if it is to be sufficiently widely adopted to
become an accepted standard.

Our philosophy in addressing this need is to automate as far as is possible
the software evaluation process. There are a vast number of interesting
problems to be posed and solved, an infinite number of possible datasets
for each one, and a number of candidate solutions limited only by the
people willing to write and submit them (in fact, in the spirit of
evolutionary computation, the candidate solutions could even be computer
generated). Hence, the best way to make significant progress is to develop
a generic distributed system which facilitates and encourages interaction
by any networked computer-user with an interest in that problem. Not only
is this desirable from the computation and data-storage perspectives, it is
also important to encourage independent corroboration of results.

Any automated evaluation system should cover the complete evaluation
process, and address the needs of diverse user groups, including those
concerned with problem specification; dataset specification, provision and
distribution; algorithm submission and distribution; and results
generation, analysis and distribution. In this paper, we analyse the needs
of these user groups, and then propose a distributed system architecture
which will be general enough to allow the evaluation of virtually any
software component or algorithm whose objective performance can be measured
empirically in a reasonable amount of CPU time.

2. USERS AND THEIR NEEDS

A number of different classes of potential users of a distributed
evaluation system can be identified:

Problem definer:
The problem definer lays down a precise specification for a problem.
This involves specifying the nature and format of the data and the interfaces
between an evaluation framework and a candidate solution algorithm. The
system should help to formalise and proceduralise this activity, which is
commonly undertaken by a project manager, and would thus help to ensure
that the resulting software solution met specified quality standards.

Algorithm developer:
With an automated evaluation system the developer writes an algorithm that
conforms to the specified interface for a problem, then simply submits the
code (or a URL for the code) to the system for automatic evaluation. It
must be emphasised that the extra effort of conforming to a particular
interface will be minimal, while the time saved by automatic evaluation
will be great.

Evaluation framework developer:
By belonging to a distributed network of nodes with a well defined
infrastructure the framework developer has more chance of his system being
widely used.

Application builder:
By application builder we mean a person (or team) who puts together an
application from a set of components. Currently, knowing which
implementation to choose for each component is an expert task. With an
automated evaluation system, the application builder could easily search a
distributed database of compatible components,
sorting the search results in order of the current set of criteria.

Dataset provider:
The system should automatically train and test a wide range of algorithms
on any submitted dataset, providing that the dataset is in an allowed
format. Hence, the dataset provider gets rapid feedback on how difficult
his data is to deal with. In the case of multilingual OCR he/she can
explore performance tradeoffs which result from building multilingual
systems versus retargetable single language systems.

Report writer/information summariser:
Here the term report writer could mean a researcher writing a technical
paper or a journalist writing a technical article for a general audience.
The current situation is that a thorough evaluation of available algorithms
for a given task is time-consuming and difficult even for a skilled
researcher. The situation for the journalist with a column to produce is
even worse, and the result of this is that, despite the growing public
interest in OCR algorithms, the reader in most cases has to make do with hype.

3. PREVIOUS RELATED WORK

Related work on performance evaluation can be classified under three
general headings: data archives, competitions, and algorithm
test-harnesses, though research has typically not addressed one area
exclusively.

Archives of data for evaluation are widely available: for example the
University of California at Irvine Machine Learning Repository [1] has a
wide range of PAMI databases (123 at present) available for download, and
the Pilot European Image Processing Archive at University of Essex [2]
provides a similar wide range of image databases. In the field of
handwriting recognition, both on-line [3] and off-line databases
are widely used. Although all these databases enable direct comparison of
different algorithm results on the same data, there is no standardisation
of APIs, resulting in unnecessary software overheads to access the databases.

Competitions typically integrate one or more databases of sample data with
an informal specification for how an algorithm should use this data, with
the aim of finding the best algorithmic solution to specific problems.
Examples of this approach include the Santa Fe [4] and 1999 Conference on
Evolutionary Computation [5] Time Series Prediction competitions, and the
Abbadingo Grammatical Inference competition [6]. The informal competition
specification constrains algorithm designers in an attempt to level the
playing field compared with supplying a database alone. In practice
however, the lack of a formal well-engineered software framework and the
focus of competition designers on one or more specific features of their
own research area mean that these competitions are idiosyncratic rather
than generalisable, and still favour particular classes of competitor (for
example those with large computational resources for off-line algorithm
training).

Finally, algorithm test-harnesses such as Delve [7] and HATE [8] go one
step further than competitions in providing not just a dataset and problem
specification, but also an evaluation framework which tries to ensure
objectivity. This approach is of obvious benefit to the algorithm
developer, dataset provider and framework developer, but still has several
restrictions compared with the proposed distributed automated approach.

4. DESIDERATA

According to Meyer [9] the software industry is going through a
revolutionary change in its interest in and acceptance of component-based
software technology - the biggest
revolution since the acceptance of object technology about ten years ago.
Meyer also observes the need for tight quality control in component
development. A distributed automated component evaluation system would
provide a much-needed mechanism for
achieving this kind of quality control within the domain of algorithm
evaluation and testing for multi-lingual OCR. Until recently, a system of
this sort would have been practically impossible. With the growth in the
Internet, and the power of Java
and HTML/XML such systems can now be constructed rapidly and efficiently.

The specification for a distributed evaluation system can be established by
considering the desiderata which exist for its performance.

Ease of use:
The system must be easy to use. This means different things for different
categories of user. Some categories of user (such as algorithm developer)
can be assumed to be competent programmers. In this case, the aim is to
provide a simple and natural API that does not impose a significant
overhead on the development of their algorithm. The report writer user
group on the other hand must be presented with simple menu-driven
interfaces that allow them to retrieve the required results summaries. For
many categories of user, the system should not require the installation of
any special software beyond a web-browser and in some cases a web-server.

Autonomy of algorithm:
An important aspect of the evaluation system is that it should evaluate the
quality of an algorithm rather than the researcher who is driving it. This
naturally follows from the nature of the system: an algorithm is submitted
for evaluation, after which it has no contact with its author.

Validity of results:
There are many ways to run tests on algorithms, and also many ways to
statistically analyse the significance of the different test scores
obtained by various algorithms (the IEEE workshop on Empirical Evaluation
Methods in Computer Vision this year is largely devoted to the subject).
Also, it is important that any chosen method is correctly implemented. The
evaluation framework should be open source and well-documented to encourage
this.

Fairness of competition:
By this we mean the concept of a level playing field. Many previous
competitions have concentrated (e.g. Abbadingo, CEC-99 time-series) on the
quality of the solution, with no attention paid to the amount of
computation needed to produce it. A fair system should carefully measure
the computational resources used by each algorithm.

Security of data:
We hope that most people submitting problems or datasets to an automated
system would make them public domain (PD). However, it is also important to
offer a solution for cases where the data are simply too private or
commercially valuable to allow this. To overcome this, evaluation systems
must be able to run at the same site as the data, and move algorithms to
that site to be evaluated.

Security of algorithm:
Again, we hope that most people submitting algorithms or software
components would make them PD, but we should also allow for cases where the
evaluation system communicates remotely with the algorithm, which never
leaves the author's site. This raises quality of results issues - it now
becomes difficult to evaluate the speed of the algorithm (a slow response
could be due to network delay), and also the accuracy of the algorithm
(e.g. an OCR algorithm under test could be aided by a person sitting at a
workstation).

5. ARCHITECTURE

We propose a system which comprises a distributed network of servers and
clients. Server nodes will run a number of installed software modules
depending on the services offered by a particular node. Clients do not
need to install any evaluation system software; they interact with the
system via a Java-enabled web-browser.

Client users (algorithm developers, dataset providers etc.) would use a
web-browser to access the evaluation system through a central portal.
They follow links to the service they require, then make requests to the
system (by now they could be linked to any other evaluation system nod
on the Internet).

Requests could be made of the evaluation system at various level of
abstraction. A high-level request (e.g. test my new OCR algorithm) would be
broken down into a more specific task (e.g. run k-fold cross validation
using this combination of problem interface, algorithm and dataset). The
dataset is then formed into a number of random training/testing partitions
and the algorithm is evaluated on each partition (Problem Instance)
utilising various compatible evaluation nodes, which could be situated
anywhere on the Internet.

6. METHODOLOGY

The difficulty in implementing such a system lies not in the complexity of
programming, but in thoroughly investigating all the design issues
associated with meeting the requirements of the various classes of users
identified above. In fact, high-level Java class libraries such as JDBC and
RMI make it possible to implement required components such as database
servers, web servers and agents with minimal code. Thus the appropriate
system development process involves successive iterative refinement in
which a minimal evaluation system is built, a competition is implemented
and run using the system, the system and competition are evaluated and the
results of the evaluation are used to inform the redesign during the next
iteration. Since conferences run on a yearly cycle, iterations of
refinement to a basic design can be developed annually, although diverse
competitions should be included in each cycle where possible.

Each iteration should aim to integrate more of the features of the proposed
complete system: in the first iteration, the focus could be largely on the
central components of problem and dataset specification, algorithm
submission and objective evaluation. This work could draw significantly
from the existing work on Delve [7] and HATE [8]. In the second iteration
more advanced distributed evaluation options and more flexible submission
procedures would be developed. Finally, the focus would shift to making
the system more appealing to less research-oriented users such as
application builders and report writers, with emphasis on presentation of
results and high-level query/request processing.

7. CONCLUSIONS

In this paper we have analysed the requirements for objective evaluation of
OCR algorithms and proposed a distributed networked evaluation framework
for satisfying these requirements. This framework could be developed
iteratively with feedback being gained by utilising it for on-line
experimental evaluation associated with regular conferences and workshops.
Although the issues concerned with multilingual OCR could be considered in
isolation, the required evaluation framework would have much more chance of
widespread adoption if it addressed the broad needs of the pattern
recognition and machine intelligence community rather than focusing on only
one specific area.

REFERENCES

1. http://www.ics.uci.edu/\verb|~|mlearn/mlrepository.html University of
California at Irvine Machine Learning Repository.
2. http://peipa.essex.ac.uk/ Pilot european image processing archive.
3. http://unipen.nici.kun.nl/ Data archive for on-line handwriting
recognition.
4. http://www.stern.nyu.edu/\verb|~|aweigend/time-series/santafe.html Santa
Fe time series prediction competition.
5. http://www.math.iastate.edu/danwell/ep99/contests.html Congress on
Evolutionary Computation 1999 Time Series Prediction Competition.
6. http://abbadingo.cs.unm.edu/ Grammatical Inference Competition.
7. http://www.cs.utoronto.ca/~delve/ Data for evaluation of learning in
valid experiments, evaluation framework and repository for datasets and
algorithms.
8. http://peipa.essex.ac.uk/hate/ Harness for Algorithm Testing and
Evaluation.
9. B. Meyer, "On to components", IEEE Computer, vol.32, no.1, pp.139-140,
(1999).

Professor A. C. Downton
Department of Electronic Systems Engineering
University of Essex
Wivenhoe Park
Colchester
Essex, CO4 3SQ

Tel: (01206) 872910 Fax: (01206) 872900
Tel international: +44 1206 872910 Fax international: +44 1206 872900

Email: acd@essex.ac.uk
WWW: http://vasawww.essex.ac.uk/~acd/acd.html