Data Requirements for Inference of Iconic and Linguistic Models

Henry S. BAIRD and Kris POPAT

Xerox PARC, 3333 Coyote Hill Rd, Palo Alto, CA 94304 USA

Our research group is involved in the exploration of methods for
document recognition, compression, and retrieval which achieve high
performance by reliance on explicitly stated, detailed models of
aspects of the target document, whether iconic (known typefaces
& image degradations) or symbolic (linguistic and pragmatic
contextual constraints). Some of these models are probabilistic and
have many free parameters that must be estimated on the basis of
sparse example data. We possess methods [K93] for the semi-automatic
inference of iconic models of machine-printed symbols in any typeface.
It is an open question how much ground-truthed input must be provided
to make this inference usefully converge, especially for scripts with
large symbol sets. We are presently investigating methods for inferring
shallow statistical models of language: specifically, character and
word n-gram Markov models. Here, even for small scripts and low-order
character models, training data is often dauntingly sparse, and smoothing
methods, which may be language dependent, play an important role. It
may be possible [S98] to infer critical parameters of image degradation
models from images of text in documents, without special test targets:
but even if possible it will depend on character shapes specific to
the script.
We wish to attend the workshop in order to discuss these and related
issues with similarly interested researchers. We are open to the
possibility of contributing or bartering ground-truthed data sets.

[K93] G. Kopec, ``Least-squares Font Metric Estimation from Images,''
IEEE Trans. on Image Processing, vol. 2 no. 4, pp. 510-519,
Oct. 1993.

[S98] E. H. Barney Smith, ``Optical Scanner Characterization Methods
using Bilevel Scans,'' Ph.D. dissertation, Dept. ECSE,
Rennselaer Polytechnica Instutute, Troy, NY, December, 1998.