Multi-lingual Features of the FineReader OCR Program
----------------------------------------------------

Vadim TERESCHENKO

ABBYY, P.O. #19, Moscow 105568 RUSSIA
Vadim_T@abbyy.com +7-095-234-4400 FAX +7-095-263-6278

FineReader is a very accurate OCR system which supports 5 alphabets: Latin,
Cyrillic, Greek, Hebrew, and Armenian. Another alphabet, Devanagari (Hindi),
is scheduled for release in Q1'2000. FineReader supports more than 55
languages and has morphological dictionaries for 21 languages. Ability to
read multi-lingual texts was set as an explicit design goal from the very
beginning of the project (1993). FineReader can read text that contains a
mix of several languages even if they belong to different alphabets, have
different code pages and different writing order (left-to-right and
right-to-left). It is interesting to note that the reading speed remains
nearly the same as for monolingual text. For example, recognition speed for
text containing mix of German, Russian and Greek words is only 25% slower
than that for plain English text.

In addition to the 55 supported languages FineReader allows a new language
to be defined by explicitly specifying the alphabet, punctuation marks, and
optional dictionary. If new language contains symbols for which FineReader
was not trained, they can be trained manually. Using this feature our
customers have successfully read languages like Tibetan, Old Russian,
Gothic German, Old Greek and phonetic transcriptions in dictionaries.

ABBYY ML/OCR activities includes:
1) research and development of multi-lingual OCR (FineReader);
2) creation of testing databases for all languages supported by FineReader;
3) benchmarking of OCR systems on mono- and multi-lingual texts.

Through participation in workshop I hope to get better understanding of
current ML/OCR activities and projects. We also interested in creation of
independent benchmarking body which will conduct annual testing of all
available mono- and multi-lingual OCR systems.