Classification and Identification of Multi-lingual Documents
------------------------------------------------------------

Jie Ding, Louisa Lam, and Ching Suen

Centre for Pattern Recognition and Machine Intelligence, Suite GM-606
Concordia University, 1455 de Maisonneuve Blvd West
Montreal, Quebec H3G 1M8, CANADA
llam@ied.edu.hk Tel. +852-2948-7808 FAX +852-2948-8014


Language classification (LC) refers to the categorization of text
documents into different natural language groups, whereas language
identification (LI) determines the language used in a document. LC
and LI play important roles in document processing systems, because
they can perform initial classifications to reduce the scope for
subsequent stages of processing [1, 2]. This study addresses these
2 topics in the following manner:

(1) LC of documents written in 24 languages into two language
categories (oriental and European), and
(2) LI of oriental documents into Chinese, Japanese and Korean.

Statistical features have been explored to differentiate between
documents printed in various natural languages. A total of 6
distinctive features are proposed, of which 3 are used for LC,
viz. horizontal projection profiles, height distributions of
connected components (CC) and enclosing structure of connected
components. Experimental results show that we are able to classify
the script of a document as either European or Asian based on four
50-CCs and obtain a high recognition rate while keeping rejection rate
at a low level.

In the LI of oriental documents, the complexity of structure, Korean
'circles' and vertical strokes have been chosen as features to
distinguish among the three language scripts. The identification
has been made according to the values of these features, and also by
a clustering algorithm.

For experimental studies, seven hundred documents were collected in
CENPARMI. The recognition rates achieved in LC and LI have exceeded
95% and 94%, with error rates below 2% and 4.5%, respectively.


Results of language classification by using one 50-component:

Language Samples Not processed Recognition(%) Error(%) Reject(%)
===========================================================================
European 262 0 95.32 4.68 0.00
Chinese 181 0 98.34 1.66 0.00
Japanese 84 0 99.21 0.79 0.00
Korean 154 0 97.62 2.38 0.00



Results of language classification by using three 50-components

Language Samples Not processed Recognition(%) Error(%) Reject(%)
===========================================================================
European 262 2 98.08 1.92 0.00
Chinese 181 2 100.0 0.00 0.00
Japanese 84 0 100.0 0.00 0.00
Korean 154 5 100.0 0.00 0.00



Results of language classification by using four 50-components:

Language Samples Not processed Recognition(%) Error(%) Reject(%)
===========================================================================
European 262 5 99.22 0.00 0.78
Chinese 181 4 100.0 0.00 0.00
Japanese 84 0 100.0 0.00 0.00
Korean 154 10 100.0 0.00 0.00


The above tables are the results of LC on the basis of one 50-component,
three 50 components and four 50 components, respectively. For the last two
cases, decisions are made by majority voting of the 3 or 4 sets of
50 components [3]. During the process, we randomly select a relatively
long text line and if necessary, several lines are concatentated to obtain
a 50 components. As the generation of one set of 50 components depends
on random selection, we do not rely on the outcome of only one trial. So in
the differentiation by using only one set of 50 components, we test the
data set several times and average the results over three trials in order
to reduce the element of chance. The results show that the error rates
are relatively higher when 1 or 3 units of 50-components are considered,
while the rejection rates are higher when an even number of units of
50-components are used.

Analyses of the results indicate that the European documents tend to be
mis-classified as Oriental ones if:

(1) The quality of the document is poor, either because many characters
are broken or when some characters touch each other.

(2) Documents written in certain fonts that are closer to handwritting than
machines printed.

On the other hand, Oriental documents tend to be classified as European
if more than 20% of the characters in the document belong to a foreign
language.

I n order to take care of those documents not process in LC, the scripts
are further processed by an LI which makes use of 3 principal features,
viz.
(i) complexity of the character 'C',
(ii) circles/ellipses 'K', and
(iii) vertical strokes 'V'.

Examination of the training data indicates that Korean documents have
"high" K and V values, while Chinese and Japanese documents have a
different range of C values. Intuitively, language identification
can be based on the C, K and V values.



Results of oriental language classification by using C, K and V values:

Language Samples Not processed Recognition(%) Error(%) Reject(%)
- - ----------------------------------------------------------------------
Chinese 114 1 94.69 4.43 0.88
Japanese 49 0 95.92 0.00 4.08
Korean 106 1 93.33 6.67 0.00



Confusion matrix when using C, K and V values:

Chinese Japanese Korean Reject
- - ----------------------------------------------------------
Chinese 107 5 0 1
Japanese 0 47 0 2
Korean 0 7 98 0


In an effort to improve the results, K-means clustering algorithm is also
adopted. Based on the size of the training samples in our database and
on some preliminary results, 4 clusters are generated for each of the
three languages and hence 12 clusters have been used to represent the
training data. For a given testing document, a full search through the
12 clusters is made in order to find the best match.


Results from clustering using C, K and V features:

Language Samples Not processed Recognition(%) Error(%) Reject (%)
- - ------------------------------------------------------------------------
Chinese 114 1 94.69 4.43 0.88
Japanese 49 0 97.96 0.00 2.04
Korean 106 1 97.14 1.91 0.95


Confusion matrix from clustering using C, K and V features:

Chinese Japanese Korean Reject
- - ----------------------------------------------------------
Chinese 107 5 0 1
Japanese 0 48 0 1
Korean 0 2 102 1


An analysis of the results indicates that Chinese documents tend to be
recognized as Japanese when they are written in Kai-font, in which strokes
are smooth and do not touch other, there are fewer complex structures in
this font. Some Korean documents are misclassified as Japanese when
"ellipses" are used to represent "circles", because these "ellipses" look
more like rectangles than circles.


In summary, our method of LC works well in the processing of documents
containing a mixture of both language groups (which are quite
common in technical documents), provided that the non-host language(s)
content does not exceed the limit of about 20% of the whole document. It
also has been developed to handle documents that might be written in any
of 24 different languages. It works well on Cyrillic documents that
do not possess the same characteristics as documents in Roman languages.
However, our Korean circle detection method cannot separate perfectly
Korean circles from ellipses and hence the recognition rate will decrease
when ellipses are used in certain Korean fonts. Also, for the Chinese Kai
font, the complex structure cannot be detected easily as the strokes of
this font are generally smooth and they do not touch each other.


Acknowledgements

This research was supported by grants from the FCAR Program of the
Ministry of Education of Quebec and the Natural Sciences and Engineering
Research Council of Canada.


References

1. J. Ding, L. Lam, and C. Y. Suen, "Classification of Oriental and
European scripts by using characteristic features," Proc. ICDAR'97,
pp. 1023-1027.

2. D. S. Lee, C. R. Nohl, and H. S. Baird, "Language identification in
complex, unoriented, and degraded document images," Proc. IAPR Workshop
on Document Analysis Systems, Malvern, Pennsylvania, Oct. 1996, pp.
76-98.

3. L. Lam and C. Y. Suen, "Application of majority voting to pattern
recognition - an analysis of its behavior and performance," IEEE Trans.
Syst., Man, and Cybern., vol. 27, 553-568, Sept. 1997.