OCR Readiness for Army Tasks: Needs for an Evaluative Framework
---------------------------------------------------------------

Melissa HOLLAND and Jeff DeHART

U.S. Army Research Lab (ARL), AMSRL-IS-CI
2800 Powder Mill Rd., Adelphi, MD 20783-1145 USA
{mholland,jdehart}@arl.mil Tel. 301-394-3001/2263 FAX -2682

The Need: Multilingual OCR for Low-Quality Documents

The U.S. Army's need for multilingual OCR centers on captured documents,
papers of coalition partners, and open-source materials encountered in
missions to non-English speaking countries. Often, these documents are
produced through substandard printing processes of third-world, developing,
or conflict-torn regions. Personal papers may be folded, torn, smudged or
otherwise compromised in quality. In addition, with regional instabilities
growing and natural disasters likely to occur anywhere, the Army needs OCR
in a wide number of languages--only some of which have commercial value.
For example, recent missions to the Persian Gulf, Bosnia, Haiti, and
hurricane-stricken Honduras have demanded a capability to process and
translate documents in Arabic, Farsi, Serbian, Croatian, Haitian Creole,
and Spanish.

Focusing the Need: FALCON

The stimulus for ARL's interest in OCR is a portable scanner-translator
prototype we have developed called FALCON (Forward Area Language Converter)
(Holland & Schlesiger, 1998). FALCON has been tried by selected military
intelligence and special operations units in Bosnia, Haiti, and Panama. In
a common usage scenario, troops in a "forward area" - at the front of a
conflict or peacekeeping mission - find a document written in, say, Haitian
Creole, then scan it, perform OCR, and submit the resulting text to machine
translation to get a rough English rendition. Troops who don't speak the
language can then tell whether the document has intelligence value and
should be sent to rear echelons for finer examination by linguists and
analysts. In this scenario, FALCON serves to filter incoming documents.
Findings from the field show that the OCR output from FALCON, obtained from
commercial packages, is often so inaccurate as to be useless for machine
translation and subsequent filtering. These findings have focused ARL's
attention on the OCR bottleneck and on how to evaluate OCR systems for
readiness to deal with low-quality documents in an array of languages.

Implications for Evaluation

As is clear from ARL's experience with FALCON, the Army need ways of
answering questions about OCR readiness:

-- whether to use OCR in an application
-- which among competing OCR packages to deploy
-- what level of performance to expect in a deployment.

Moreover, these questions must be answered in the context of documents that
range widely in quality and that occur in multiple languages. The Army's
needs for evaluation intersect with topics of this workshop as follows:

We require access to ground-truthed data sets in many languages, including
low-diffusion languages like Farsi and Haitian.
We need data sets that either represent, or can be adjusted to represent,
the classes of degradation typical of the field.
We need a translingual framework for interpreting OCR performance data in
order to support decisions about the readiness of OCR in different
languages. We need to relate abstract performance metrics to performance
of the task.

The last topic bears elaboration. Because OCR in FALCON is embedded in a
larger process that includes machine translation (Voss & Reeder, 1998), we
need to know the impact of OCR error rates on (a) the output of translation
(itself hard to assess) and (b) the uses of translation (e.g., making
decisions about document relevance). By relating traditional OCR
performance metrics to task suitability, we may be able to set
interpretable benchmarks by which to compare OCR in different languages.
Alternatively, comparisons using traditional error measures may mislead due
to variations in morphological and orthographic makeup across languages. A
hierarchy of tasks we are using for comparison of MT systems is potentially
useful for OCR systems. Developed by Taylor & White (1998), it ranks tasks
from less to more demanding of MT, for example, filtering on the low end
and text summarization on the high end.

REFERENCES

Holland, M., & Schlesiger, C. (1998). High-mobility machine translation for
a battlefield environment. The application of information technologies to
mission systems: Proceedings of NATO/RTO Systems Concepts and Integration
Symposium, Monterey, CA, 15/1-3. Hull, Canada: CCG, Inc. (ISBN
92-837-1006-1).

Taylor, K., & White, J. (1998). Predicting what MT is good for: User
judgments and task performance. In D. Farwell et al. (Eds.), Machine
translation and the information soup: Proceedings of the Association for
Machine Translation in the Americas Annual Meeting, Langhorne, PA, 364-373.
NewYork: Springer.

Voss, C., & Reeder, F. (Eds.) (1998). Proceedings of the workshop on
embedded machine translation: Design, construction, and evaluation of
systems with an MT component. (In conjunction with the Association for
Machine Translation in the Americas Annual Meeting, Langhorne, PA).
Adelphi, MD: Army Research Lab.