Testing and Evaluation of Real-Time Address Recognition Systems
---------------------------------------------------------------

Alfred LAWSON*, Srirangaraj SETLUR** and Venu GOVINDARAJU**

* United States Postal Service, Merrifield, VA 22082 alawson@email.usps.gov
** Center of Excellence for Document Analysis & Recognition, Amherst, NY 14228
{setlur,govind}@cedar.buffalo.edu

[This work was supported by contracts from the United States Postal Service.]

Abstract

This paper describes the intricacies involved in the design of a
method for evaluating improvements in performance of a real-time
address recognition system being used by the United States Postal
Service for processing mail-piece images.

Evaluation of the performance of address recognition systems is normally
carried out by measuring the performance of the system on a representative
sample of images. Designing of a valid and comprehensive testing
scenario is a complex task that involves a number of steps.

Sampling live mail-stream to generate a deck of images for
testing, "truthing" of a significant number of images, grading and
evaluation of the performance of the system are all processes that
are labor-intensive and hence need to be addressed with a lot of
forethought.

Test Deck Generation

A fundamental goal of the test deck generation process is to ensure that
the images collected for the test deck are representative of the image
population observed in actual field operation.

The primary goal of the evaluation process is to evaluate the
improvements of a address interpretation system over a
base-line version of the same system with emphasis on handwritten
mail-piece processing.

Two key metrics used in the performance evaluation of an address
interpretation system are the encode rate of the system and its error rate.

To evaluate the error rate, it is necessary to have a deck which is
"truthed" so that the accuracy of the encode can be checked. However,
truthing is an extremely expensive process. So, a further sampling of
the test deck is done to generate a smaller deck that will be truthed.

Keeping in mind the twin objectives of measuring the encode rate and
the error rate, a one million handwritten image test deck was proposed
to be generated, consisting of contemporary images from a minimum of
120 of the 251 Remote Computer Reader sites distributed all over the
United States. Since the mail-stream consists of a mix of handwritten
and machine-print images and one of the goals of the system is to
separate the handwritten images from the machine-print images and
process them appropriately, a machine print test deck of at least
250,000 pieces from the same 120 sites would also be dispersed through the
test deck prior to testing.

The handwritten image test deck consists of a collection of two sets of
500,000 images each. The first set of 500,000 images is made up of
25,000 samples collected from each of 20 selected sites. 50,000 of
these images selected randomly will be truthed. The second image set
consists of 500,000 images from 100 different field locations. 50,000
of these images selected randomly will also be truthed.

A single machine print test deck with at least 250,000 images collected
from the same 120 sites as the handwritten test deck including 100,000
truthed images dispersed throughout the deck will be used to evaluate
machine print encoding performance.

Truthing

The error rate of an address interpretation
system is a very crucial metric in the performance evaluation of the
system. Incorrect encodes would not only increase the cost of the
processing necessary to deliver the mail-piece but would also serve to
tarnish the reputation of the postal system in the eyes of the public.
Hence, it is absolutely essential that the error rate of the system
must be measured accurately and the a very low threshold of error
rate imposed on the system.

The most effective way of measuring the error rate is to compare the
encodes returned by the system against the encode determined by a
human looking at the mail-piece with access to the relevant postal databases.
This process of recording the correct encode corresponding to a mail-piece
image by humans is termed *truthing*.

Truthing is by its very nature a labor-intensive and expensive
process. Hence, it is important to come up with as small a sub-set of
images as possible that is representative of the population of the entire
mail-stream. To provide an indicator of the largeness of the mail-steram
population that needs to be processed by any address interpretation
system, the system has to process an average of a 1,000,000 pieces of mail
per day at over 250 mail processing centers around the country.

Hence, the number of images that need to be sampled for truthing is
quite large.

The paper will address in detail the US Postal Address Encoding Scheme
and the issues involved in truthing mail-stream images.

Truthing Tools

CEDAR uses tools where entire sets of images can be truthed in a
distributed environment. A mapping of hot function keys to special
keys on the keyboard ensures fast keying.

Image Scoring
We have developed tools to assist in the scoring of the images. We
currently adopt a two-step process for scoring images. ASCII
representations of address blocks are transcribed using a GUI editor
and the address blocks are passed through an ASCII address matching
engine that conforms to the Scoring Rules provided by the USPS which
attempts to encode the address blocks and generate a truth-line in
the format required by the USPS. The ASCII address blocks rejected
by the ASCII address matching engine are passed to a second
truthing station with advanced tools for manual truthers to encode.

Image Resolution
We have developed a tool that allows images to be "truthed" by two
different truthers and permits differences to be resolved by a third
truther.

Vendor Review and Resolution
We ropose tre also developing an interface to input a result (in a
specified format) file from a vendor system run and allow for review
and resolution of the vendor results against the official truth.

Final Resolution
We will also extend the capability of vendor resolution developed
above to view the official truth as well as the results produced by
multiple vendors simultaneously.

Vendor Verification
We have already developed tools to grade results supplied by the
vendor against the truth supplied in a specified format.

Image Database System(IDS)
We propose to develop a database system with functions to generate
test decks from a given set of images based on user specified parameters.

Database Look-up Engine
We use a customized USPS database for database look-ups. This also
ensures that the database being used by vendor systems matches the
database being used for truthing.

HWAI System Aided Truthing
An additional time-saving innovation has been to use the Handwritten
Address Interpretation System itself to generate results and then
verify the encode and correct it if necessary thereby reducing the
time required to truth the mail-piece.

Testing and Evaluation

The testing and evaluation model for an address interpretation system
also has to take into account some problems that are unique to
the postal mail-stream scenario.

1. Postal addresses are not very *static* for the follwoing reasons:

a) New addresses are added to the database.
b) ZIP codes get translated into new ZIP codes and addresses move
between ZIP codes.
c) PO Box numbers get cancelled or re-assigned.
d) New firms are added to the database.

The postal databases are a key component of the system-processing as
well as the truthing process. If the truthing is done using a database
from a time different from the database that is used by the system
being tested, the testing process would be flawed since the truth for the
same mail-piece given one temporal instantiation of the database might
not be the truth given a different temporal instantiation of the
database. Hence, it is very important to ensure that the database
used by the system during the test is the same temporally as the
database used for truthing.

2. Patrons sometimes write ambiguous, incomplete or erroneous addresses
on mail-pieces. Comprehensive encoding rules have to be developed to ensure
that the resultant encode for a given address is uniquely determined
and the test deck has to be truthed in accordance with these rules so
that the system can incorporate the same rules in its encoding scheme
and the evaluation will not suffer as a result of multiple encode
possibilities.

3. Handwritten addresses present another source of confusion. Subjective
interpretation of ambiguous digits and characters present problems
in determining the truth unambiguously.