OCR Readiness for Army Tasks: Needs for an Evaluative Framework
      ---------------------------------------------------------------
                    Melissa HOLLAND and Jeff DeHART		
               U.S. Army Research Lab (ARL), AMSRL-IS-CI
            2800 Powder Mill Rd., Adelphi, MD 20783-1145 USA
       {mholland,jdehart}@arl.mil  Tel. 301-394-3001/2263 FAX -2682
The Need: Multilingual OCR for Low-Quality Documents
   The U.S. Army's need for multilingual OCR centers on captured documents, 
papers of coalition partners, and open-source materials encountered in 
missions to non-English speaking countries. Often, these documents are 
produced through substandard printing processes of third-world, developing, 
or conflict-torn regions. Personal papers may be folded, torn, smudged or 
otherwise compromised in quality. In addition, with regional instabilities 
growing and natural disasters likely to occur anywhere, the Army needs OCR 
in a wide number of languages--only some of which have commercial value. 
For example, recent missions to the Persian Gulf, Bosnia, Haiti, and 
hurricane-stricken Honduras have demanded a capability to process and 
translate documents in Arabic, Farsi, Serbian, Croatian, Haitian Creole, 
and Spanish.
Focusing the Need: FALCON
   The stimulus for ARL's interest in OCR is a portable scanner-translator 
prototype we have developed called FALCON (Forward Area Language Converter) 
(Holland & Schlesiger, 1998). FALCON has been tried by selected military 
intelligence and special operations units in Bosnia, Haiti, and Panama. In 
a common usage scenario, troops in a "forward area" - at the front of a 
conflict or peacekeeping mission - find a document written in, say, Haitian 
Creole, then scan it, perform OCR, and submit the resulting text to machine 
translation to get a rough English rendition. Troops who don't speak the 
language can then tell whether the document has intelligence value and 
should be sent to rear echelons for finer examination by linguists and 
analysts. In this scenario, FALCON serves to filter incoming documents. 
Findings from the field show that the OCR output from FALCON, obtained from 
commercial packages, is often so inaccurate as to be useless for machine 
translation and subsequent filtering. These findings have focused ARL's 
attention on the OCR bottleneck and on how to evaluate OCR systems for 
readiness to deal with low-quality documents in an array of languages.
Implications for Evaluation
   As is clear from ARL's experience with FALCON, the Army need ways of 
answering questions about OCR readiness:
    -- whether to use OCR in an application
    -- which among competing OCR packages to deploy
    -- what level of performance to expect in a deployment.
Moreover, these questions must be answered in the context of documents that 
range widely in quality and that occur in multiple languages. The Army's 
needs for evaluation intersect with topics of this workshop as follows:
   We require access to ground-truthed data sets in many languages, including 
low-diffusion languages like Farsi and Haitian.
We need data sets that either represent, or can be adjusted to represent, 
the classes of degradation typical of the field.
We need a translingual framework for interpreting OCR performance data in 
order to support decisions about the readiness of OCR in different 
languages.  We need to relate abstract performance metrics to performance
of the task. 
   The last topic bears elaboration. Because OCR in FALCON is embedded in a 
larger process that includes machine translation (Voss & Reeder, 1998), we 
need to know the impact of OCR error rates on (a) the output of translation 
(itself hard to assess) and (b) the uses of translation (e.g., making 
decisions about document relevance). By relating traditional OCR 
performance metrics to task suitability, we may be able to set 
interpretable benchmarks by which to compare OCR in different languages. 
Alternatively, comparisons using traditional error measures may mislead due 
to variations in morphological and orthographic makeup across languages. A 
hierarchy of tasks we are using for comparison of MT systems is potentially 
useful for OCR systems. Developed by Taylor & White (1998), it ranks tasks 
from less to more demanding of MT, for example, filtering on the low end 
and text summarization on the high end.
REFERENCES
Holland, M., & Schlesiger, C. (1998). High-mobility machine translation for 
a battlefield environment. The application of information technologies to 
mission systems: Proceedings of NATO/RTO Systems Concepts and Integration 
Symposium, Monterey, CA, 15/1-3. Hull, Canada: CCG, Inc. (ISBN 
92-837-1006-1).
Taylor, K., & White, J. (1998). Predicting what MT is good for: User 
judgments and task performance. In D. Farwell et al. (Eds.), Machine 
translation and the information soup: Proceedings of the Association for 
Machine Translation in the Americas Annual Meeting, Langhorne, PA, 364-373. 
NewYork: Springer.
Voss, C., & Reeder, F. (Eds.) (1998). Proceedings of the workshop on 
embedded machine translation: Design, construction, and evaluation of 
systems with an MT component. (In conjunction with the Association for 
Machine Translation in the Americas Annual Meeting, Langhorne, PA). 
Adelphi, MD: Army Research Lab.