Multilingual Document Images and Corpora
----------------------------------------

A. Lawrence SPITZ

Document Recognition Technologies, Inc.
616 Ramona St, Suite 20, Palo Alto, CA 94301 USA
spitz@docrec.com Tel. +1-650-688-0842 FAX +1-650-688-0841

Images

We have over 700 page images in the following languages: Africaans,
Chinese, Danish, Dutch, English, French, German, Icelandic, Italian,
Japanese, Korean, Norwegian, Portuguese, Romanian, Spanish, Swedish.

Truthed images

We have about 600 "truthed" pages in the following languages:
Africaans, Danish, Dutch, English, French, German, Icelandic, Italian,
Norse, Norwegian, Portuguese, Romamian, Spanish and Swedish. However
these should be considered only approximations to the truth since the
data have been entered only once. We are seeking sponsorship for
completion of the truthing process.

Corpora

We have corpora ranging in size from 600KB to 91 MB (averaging 16MB) in
the following languages: Africaans, Danish, Dutch, English, French,
German, Icelandic, Italian, Norwegian, Portuguese, Romanian, Spanish.
These corpora have been cleaned up in terms of removing special
characters and any markup.

We would be willing to consider cooperative efforts where DocRec and
other entities combine data and analysis effort in return for shared
rights to use and distribute the data.

Software

We have a suite of software that runs on UNIX or WindowsNT that is capable
of deteriniming the script and language of the document. It currently
handles 23 languages in Roman script plus Chinese, Japanese and Korean. We
are seeking sponsorship for expanding this capability to a broader range
of scripts and languages.