MULE: Multi-lingual Features of the Latest GNU Emacsen
------------------------------------------------------

Tomohiko MORIOKA, Mikiko NISHIKIMI, Naoto TAKAHASHI,
Ken'ichi HANDA, and Satoru TOMURA

Electro Technical Laboratory
1-1-4 Umezono, Tsukuba, Ibaraki, JAPAN
{tomo,nisikimi,ntakahas,handa,tomura}@etl.go.jp


1. Introduction

Text editors are fundamental tools for text manipulation. Multilingual text
thus requires multilingualization of text editors. This paper describes
Mule, a multilingual text processing system. Mule (Multilingual Enhancement
to GNU Emacsen) was first developed as an extension of GNU Emacs. Mule
handles multiple character sets and multiple language environments and
provides means to construct new ones. Although Mule started as an extension
of GNU Emacs, some of its facilities has been merged to the original GNU
Emacs since 1997 (ver. 20.1 or later) and XEmacs also has a different
implementation of Mule. We will describe the current status of both
implementations.

The design policy of Mule is to keep it adjustable, extensible, and easy to
customize or augment facilities. In order to achieve this goal, we give
Mule a unified mechanism for multiple languages. Text processing is complex
work and different languages may require different handling, in many
aspects: inputting, storing, restoring, and displaying. When we say Mule
supports a certain language, it means that Mule can at least:

- read and write files encoded in any common format for that language,
- provide input methods for that language, and
- display text appropriately if a suitable font exists.

These facilities are controlled by the unified mechanism for handling
character sets, coding systems, input methods, and display routines. In
other words, defining these four points extends Mule to support a new
language.

In section 2, we will explain how Mule treats coded character sets and
coding systems, and we will list the scripts and encodings that Mule can
handle. Section 3 and the next section briefly overview language specific
features and input methods that Mule supports. Section 5 discusses the
character composition feature, required to properly display the Thai, Lao,
Devanagari and Tibetan scripts. Section 6 describes the support for UCS
that is available in versions of Emacs 20.4 and XEmacs 21.2 (unreleased as
of this writing) is supported by MULE-UCS and XEmacs-UCS respectively. In
section 7, we show our future plans including bi-directional writing systems
and section 8 concludes this paper.


2. Basic concept of Mule

Mule is designed to provide extensible framework of multilingual processing,
which is realized by two extensible features: "Mule-charset" and
"coding-system". Mule supports many pre-defined Mule-charsets and
coding-systems but users can freely define private ones.

2.1 Mule-charset

"Mule-charset" is a feature that defines a coded character set (CCS).
Current Mule implementations expect each Mule-charset has the
structure of graphics character set defined in ISO/IEC 2022. Mule
supports 4 types of charset; 94, 96, 94x94 and 96x96. If a CCS to be
used does not have the ISO/IEC 2022 structure, it must be first mapped
to fit into the framework. For example, VISCII (a CCS for Vietnamese)
is divided into two 96-character type Mule-charsets.

Although each CCS defined in ISO/IEC 2022 register is identified by
its size-type and final-byte, Mule identifies each CCS by a unique
identification number called "charset-id". A Mule-charset defined by
associating a unique charset-id to the corresponding CCS and by
informing Mule of three parameters of the CCS. The parameters are
byte length in Mule's internal representation, displaying width, and
writing direction. It is also possible to define a character set that
is not registered in ISO/IEC 2022 register. In that case, Mule uses a
final-character reserved for private use by ISO/IEC 2022.

Instead of using fixed length representation, we adopted multi-byte
variable length form (multibyte form) to represent characters in
Mule's buffer both for efficient memory usage and for extensibility.
With multibyte form, each character is represented by one or two bytes
for charset-id and one or two bytes for the character code in the
character set. The only exception is ASCII characters, they are
represented as is, and the charset-id is 0.

The following table shows pre-defined Mule-charsets. Users can,
however, define private Mule-charset if no pre-defined one anwsers
their purposes.
======================================================================
Emacs 20.4 XEmacs 21.2-Mule
----------------------------------------------------------------------
Latin
Basic [ASCII) ascii ascii
(for Right-to-Left) ascii-right-to-left ---------------
[ISO 8859-1] latin-iso8859-1 latin-iso8859-1
[ISO 8859-2] latin-iso8859-2 latin-iso8859-2
[ISO 8859-3] latin-iso8859-3 latin-iso8859-3
[ISO 8859-4] latin-iso8859-4 latin-iso8859-4
[ISO 8859-9] latin-iso8859-9 latin-iso8859-9
[JIS X 0201-Latin] latin-jisx0201 latin-jisx0201
[VISCII] vietnamese-viscii-lower vietnamese-viscii-lower
vietnamese-viscii-upper vietnamese-viscii-upper
International
Phonetic Alphabet ipa ipa
Chinese
Phonetic Symbols chinese-sisheng sisheng
Chinese Traditional
[BIG5] chinese-big5-1 chinese-big5-1
chinese-big5-2 chinese-big5-2
[CNS 11643 plane 1] chinese-cns11643-1 chinese-cns11643-1
[CNS 11643 plane 2] chinese-cns11643-2 chinese-cns11643-2
[CNS 11643 plane 3] chinese-cns11643-3 chinese-cns11643-3
[CNS 11643 plane 4] chinese-cns11643-4 chinese-cns11643-4
[CNS 11643 plane 5] chinese-cns11643-5 chinese-cns11643-5
[CNS 11643 plane 6] chinese-cns11643-6 chinese-cns11643-6
[CNS 11643 plane 7] chinese-cns11643-7 chinese-cns11643-7
Chinese Simplified
[GB 2312] chinese-gb2312 chinese-gb2312
[CCITT Extended GB] ------------------ chinese-isoir165
Japanese
[JIS X 0208:1978] japanese-jisx0208-1978 japanese-jisx0208-1978
[JIS X 0208:1983] japanese-jisx0208 japanese-jisx0208
[JIS X 0212:1990] japanese-jisx0212 japanese-jisx0212
Katakana [JIS X 0201] katakana-jisx0201 katakana-jisx0201
Korean [KS C 5601] korean-ksc5601 korean-ksc5601
Cyrillic [ISO 8859-5] cyrillic-iso8859-5 cyrillic-iso8859-5
Ethiopic ethiopic ethiopic
Greek [ISO 8859-7] greek-iso8859-7 greek-iso8859-7
Thai
[TIS 620] thai-tis620 thai-tis620
[XTIS] ----------- thai-xtis
Lao lao ---------------
Tibetan tibetan ---------------
tibetan-1-column ---------------
Indian indian-1-column ---------------
indian-2-column ---------------
[IS 13194] indian-is13194 ---------------
Arabic arabic-1-column arabic-1-column
arabic-2-column arabic-2-column
arabic-digit arabic-digit
[ISO 8859-6] arabic-iso8859-6 arabic-iso8859-6
Hebrew [ISO 8859-8] hebrew-iso8859-8 hebrew-iso8859-8
(C1-Control) --------------- control-1
======================================================================

2.2 coding-system

"coding-system" is a feature thate defines a character encoding scheme
(CES). Fortunately, most of the exsiting encoding schemes fit in the
framework of ISO/IEC 2022 and we have categorized CESs into ISO/IEC
2022 type and non-ISO/IEC 2022 type.

For ISO/IEC 2022 CESs, Mule has a generic encoder/decoder. Although
ISO/IEC 2022 allows lots of variations to encode the same text, just a
few of them are actually used, and only six parameters are enough to
specify one CES. For instance, Chinese, Japanese, and Korean variants
of EUC (Extended UNIX Code) and all ISO 8859 series (Part number 1
through 10) differ only in two parameters.

For non-ISO/IEC 2022 CESs, current Mule has specific builtin
encoder/decoder for Shift_JIS and BIG5, and XEmacs 21.2-Mule also has
builtin scheme for UTF-8 and UCS-4. Mule also provides a simple
language called CCL (Code Conversion Language) and its interpreter.
CCL is powerful and suitable for writing code conversion algorithm,
which means that theoretically Mule can handle any kind of coding
system with an appropriate CCL program.

Whenever code conversion is required, Mule automatically invokes
either the built-in interpreter or the CCL interpreter.

The following table shows example pre-defined coding systems. Users
can, however, define private coding systems no pre-defined one anwsers
their purposes.
======================================================================
Coding System
----------------------------------------------------------------------
ISO/IEC 2022
8bit iso-8859-1, iso-8859-2, iso-8859-3, iso-8859-4,
iso-8859-5, iso-8859-7, iso-8859-8, iso-8859-9,
cn-gb-2312 = gb2312 (*X), euc-jp, euc-kr
7bit iso-2022-kr,
iso-2022-cn (*E), iso-2022-cn-ext (*E)
iso-2022-jp = junet, iso-2022-jp-2,
iso-2022-cjk (*E), iso-2022-int-1
Generic iso-2022-7bit, iso-2022-7bit-ss2,
iso-2022-7bit-lock (*E), iso-2022-lock (*X),
iso-2022-7bit-lock-ss2 (*E),
iso-2022-8 (*X), iso-2022-8bit-ss2
Character Composition thai-tis620 (*E) = tis620 (*E), lao (*E),
tibetan (*E) = tibetan-iso-8bit (*E),
in-is13194-devanagari (*E) = devanagari (*E),
----------------------------------------------------------------------
Pseudo ISO/IEC 2022 compound-text (*E) = ctext, escape-quoted (*X)
----------------------------------------------------------------------
CCL koi8-r, alternativnyj, viscii, vscii,
tis-620 (*X)
----------------------------------------------------------------------
Shift_JIS japanese-shift-jis = shift_jis = sjis
----------------------------------------------------------------------
BIG5 (CN-BIG5) chinese-big5 = big5 = cn-big5
----------------------------------------------------------------------
Elisp conversion chinese-hz = hz = hz-gb-2312, viqr
----------------------------------------------------------------------
No conversion
binary binary = no-conversion (*E)
raw-text raw-text = no-conversion (*X)
Emacs internal emacs-mule (*E)
----------------------------------------------------------------------
Auto-Detection undecided = automatic-conversion (*X)
======================================================================
*E: only in Emacs 20; *X: only in XEmacs-MULE


3 Language specific facilities

Most of the localizations are based on a locale mechanism that causes
no problem as far as each computer environment is limited to its own
locale. However, using multiple languages at one time is beyond the
scope of localization. Mule aims at multilingualization. Users can
utilize multiple languages anytime anywhere and no locale setting is
required.

People who have benefited from localization, however, might feel
inconvenience if there is no locale settings. In order to avoid it,
Mule supports a facility called language environment. Setting the
language environment does not restrict Mule to the language. It only
sets some default values including the followings.

o The default coding system that is used when automatic
code-detection routine fails to narrow down to one. For example,
if language environment is set to Chinese and the code-detection
finds a file to be coded in EUC (Extended UNIX Code), the file is
regarded to be coded in the Chinese variant of EUC.

o The default coding system used for newly created files. If
language environment is Chinese and a brand new file is created,
the file is also coded in Chinese variant of EUC.

o The default input method. If language environment is Chinese and
Quail input method is invoked, one of the translation rule
packages for Chinese characters is first activated.

o The language of the tutorial file. Mule tutorial file is
translated into some languages and if the translation of the
language defined in language environment exist, Mule shows the
translated version.

The following table shows pre-defined language environments. New
language environments can be defined if users need one.
======================================================================
Emacs 20.4 XEmacs 21.2-MULE
----------------------------------------------------------------------
ASCII ASCII ASCII
Chinese
Chinese-GB Chinese-GB
Chinese-BIG5 Chinese-BIG5
Chinese-CNS ------------
Croatian ------------ Croatian
Cyrillic ------------ Cyrillic
Cyrillic-ALT Cyrillic-ALT
Cyrillic-ISO Cyrillic-ISO
Cyrillic-KOI8 Cyrillic-KOI8
Czech Czech -------------
Devanagari Devanagari -------------
English English English
Ethiopic Ethiopic Ethiopic
French -------- French
German German German
Greek Greek Greek
Hebrew Hebrew Hebrew
IPA IPA IPA
Japanese Japanese Japanese
Korean Korean Korean
Lao Lao ----------
Latin-1 Latin-1 Latin-1
Latin-2 Latin-2 Latin-2
Latin-3 Latin-3 Latin-3
Latin-4 Latin-4 Latin-4
Latin-5 Latin-5 Latin-5
Norwegian -------- Norwegian
Polish -------- Polish
Romanian Romanian Romanian
Slovak Slovak --------
Slovenian Slovenian --------
Thai Thai Thai-XTIS
Tibetan Tibetan ----------
Vietnamese Vietnamese Vietnamese
======================================================================


4 Input Method

Multilingual text editors have to handle many kinds of scripts, and
different scripts requires different input methods. In the desigh of
Mule, input methods are categorized into the following four types.

(1) key mapping

The simplest input method that maps one key on normal English
keyboards to another character. Typical examples are Greek and
Russian.

(2) key combination

Some input methods generate a composite character by combining a
sequence of keys. Typical examples are European languages and
Vietnamese in which a character with a diacritical mark and a tone
mark are generated from an alphabet and some symbol keys.

For instance, in Vietnamese input method, a key sequence of ,
<^>, and <'> generates one Vietnamese character. There also exist
several Chinese input methods of this type.

(3) mixture of key mapping and key combination

Input methods of this category first map keys and then generate
composite character by combining the mapped keys. Typical
examples are Thai and Korean. For instance, in Thai input method,
all keys are mapped to Thai consonants, vowels, or tone marks, and
a key sequence of a consonant character and the following vowel or
tone mark generates a composite character that puts vowel and/or
tone mark on top or beneath the consonant.

(4) mixture of key mapping, key combination and an external conversion
program

This kind of methods are used for inputting Ideographic
characters. Since there are more than ten thousand Ideographic
characters, it is not realistic to remember all of the key
combinations. On the other hand, inputting phonetic characters is
rather easy. These input methods let users input phonetic
characters with the key mapping or key mapping and key combination
methods, and then turning the remaining task of generating
Ideographic characters to conversion programs.

For instance, in Japanese input method, a Hiragana (Japanese
phonetic alphabet) sequence is typed in at first, then the
sequence is converted by a conversion program into an appropriate
mixture of Kanji (Ideographic character) and Hiragana. Wnn,
Canna, SJ3 (these are all for Japanese), and cWnn (for Chinese)
are conversion programs that can be used from Mule. They usually
have vary large dictionary and knowledge about grammar.

Emcasen realizes input methods using key mapping or key
combination as a keyboard input translation system named Quail.
Quail uses one set of translation rules (called `Quail package')
at a time and translates user input accordingly. Users can add
new translation rules or modify existing translation rules in
order to customize a Quail package, or create a brand new package
for a new language they need.


5 Character Composition

In the CCS for scripts such as Thai, Lao, Devanagari and Tibetan, one
codepoint defines a glyph and multiple glyphs must be composed into
one character. In order to display these scripts properly, Mule
dynamically composes multiple glyphs into one character. This
facility is called "dynamic composition".

First let us consider the case of Thai and Lao. In these scripts, a
character can be displayed by vertically stacking up glyphs
corresponding to codepoints. When Mule reads from or write to files,
or receive input from users, a sequence of characters (usually
Consonant+Vowel[+Tone] sequence) is put into one composed character.
Mule superimposes gliphs of the constituent characters using their
metrics and displays a composed character.

In the case of Devanagari and Tibetan, however, glyphs that correspond
to codepoints are not enough to compose a appropriate character. In
these scripts, one same vowel requires different glyphs in different
combinations. Glyphs of consonants may also change their shapes by
ligature. Mule has an internal character code for each of such glyphs
and one rule-base for each script. The rule-base defines a conversion
between a sequence of characters and that of internal character codes.
Mule converts characters into internal character codes while user or
file I/O. The rule-base also controls how the glyphs corresponding to
internal codes should be composed. Mule utilizes these rules and
metrics of constituent glyphs in order to display properly composed
characters.

"Dynamic composition" allows combinations of characters that are not
contained in the scripts. It is sometimes regarded as a problem.
Moreover, it is hard to design such fonts that appear neat and pretty
when composed. Pre-composed characters can be a solution when a
script do not need too many combinations. Thai script, for instance,
only contains about 1000 combinations. "XTIS" proposed by Virach
Sornlertlamvanich is an example of pre-composed Thai characters and
implemented for Emacs 20 and XEmacs-MULE. Pre-composed Lao characters
are also now under consideration.


6 UCS support

Mule has a model of encoding schemes based on ISO/IEC 2022 and tools
that enable code conversion among various encoding schemes. It is not
easy, however, for users to develop a converter for a non ISO/IEC 2022
scheme such as UTF-8 or UTF-16 of UCS (Universal Multi-Octet Coded
Character Set; ISO/IEC 10646) or Unicode.

MULE-UCS on Emacs 20.4 and XEmacs-UCS on XEmacs 21.2 are the
mechanisms for conversions from/to UCS.

MULE-UCS is an Emacs Lisp library that receives (1) mapping tables
between Mule-charsets and UCS codepoints, (2) the order of priority
among the tables, which users may arbitrarily define, and generates
CCL programs necessary to realize the conversion. Users can define a
new conversion policy just by changing the priorities of mapping
tables and they do not need to know the mapping tables in detail. On
the other hand, it is not easy to define a conversion character by
character.

Another merit of using MULE-UCS is that it can optimize CCL programs
it generates. Thus, it can effectively manage computational resources
such as memories even though users have multiple conversion policies.
As it takes much time for MULE-UCS to generate optimized CCL programs,
MULE-UCS is not very good at changing conversion policies dynamically.

XEmacs-UCS manages the tables for decoding and encoding characters
with the help of four built-in functions that set or refer the mapping
tables. XEmacs-UCS has also extended the scheme of defining coding
system in order to include coding systems like UTF-8 and UCS-4 in
Mule's framework. Users are required to define a conversion between
characters, which makes it possible to dynamically and flexibly change
settings. Information stored in the mapping tables can be also easily
used in other purposes such as resolving character references in XML.

XEmacs-UCS manipulates a mapping table itself, and MULE-UCS generates
efficient conversion programs from given mapping tables. Future UCS
supports should be the one that can do the both steps of conversion
procedure.


7 Future works

7.1 Bi-directional scripts

One of the important, and complicated, features that should be
supported in a truly universal multilingual editor is support for
bi-directional writing systems. Unfortunately, neither Emacs 20.4 nor
XEmacs 21.2-Mule supports this feature.

Some languages are written from right to left. Arabic and Hebrew are
the best known examples. It should be noted that even in these
languages, numerals are written from left to right. For example, `one
hundred and twenty three' is written with the digit `one' on the left
side and the `three' on the right side. Thus, support of
bi-directional writing is inevitably necessary for those languages,
even in a unilingual text editor.

Theoretically, there are two possible ways for storing bi-directional
text in memory. One is logical order, which stores characters in the
order humans read them. The other is visual order, which stores
characters as they are physically displayed on the screen. Visual
order is easier to implement, but there are enough reasons for which
we believe that the internal representation should be in logical
order. For example, interprocess communication is supposed to be done
in logical order and it would be very difficult if the internal
representation is in visual order.

When bi-directional writing is implemented based on logical order,
even the most basic editing operations, namely cursor motions, can be
confusing for users. Hitting the right arrow key repeatedly may not
result in a continuous cursor movement; the cursor jumps when it
reaches at a boundary of a left-to-right string and a right-to-left
string, then starts moving in the opposite direction. Insertion and
deletion are also confusing, especially when these actions take place
at boundaries where two strings of different directions meet.

Mule-2.3, which is an ancestor of Emacs 20.4, supported bi-directional
writing with the internal representation being in logical order. To
avoid users' confusion, a special mode, in which users can edit the
text as if it were stored in visual order, was provided.

We are planning to re-implement the bi-directional feature of Mule-2.3
in future versions of Emacs and XEmacs with some enhancements. One of
the planned enhancements will be support for nested writing
directions. Since Mule-2.3 recognized the writing direction only by a
property assigned to each character set, it was not able to handle
direction nesting properly. People may argue that `an English text
(left to right) embedded in an Arabic text (right to left) embedded in
an English text (left to right)' is rare. However, as we mentioned
earlier, Arabic (and Hebrew) numerals are written from left to right.
Therefore it is highly possible that an Arabic text (right to left)
including Arabic numerals (left to right) appears within an English
text. We thus believe that direction nesting, with at least two
levels of nesting supported, should be handled properly.

7.2 Problems with current character composition

In the current implementation of dynamic character composition, a Mule
buffer may contain internal character codes used only for the purpose
of text rendering, which causes difficulties with text processing
operations such as searching or sorting. We plan to change the
implementation so that the information used only for display are
stored in other places than buffers and buffers are reserv* *ed only
for sequences of characters.


8 Conclusion

We have introduced Mule, a multilingual editor. Mule has a unified
mechanism for multiple languages that makes it adjustable, extensible,
and easy to customize.

Since Mule was first released in 1993, people in the world have
contributed supports for their own languages. This fact shows how
widely Mule is used and how easily it can be extended. Aside from new
language support, there also exist many contributed applications
running on Mule, such as on-line dictionary looking-up tools, MIME
encoder and decoder. The existence of these tools proves that Mule
can be a multilingual workbench/environment, rather than a mere
editor.


Appendix: Mule Distribution

GNU Emacs and XEmacs are distributed under the term of GNU GENERAL
PUBLIC LICENSE, and available through anonymous ftp from the following
sites.

o Emacs 20.3 (including Mule features)
ftp://ftp.gnu.org/pub/gnu/

o XEmacs
ftp://ftp.xemacs.org/pub/xemacs/
http://www.xemacs.org/