"To us
all towns are one, all men our kin. |
Home | Whats New | Trans State Nation | One World | Unfolding Consciousness | Comments | Search |
Home > Tamil Digital Renaissance > Tamilnet'97 > Unicode and Tamil - Issues with Implementation - Muthelilan Murasu Nedumaran
Unicode
and Tamil - Issues with Implementation
Muthelilan Murasu Nedumaran
Abstract
With the advent of the Unicode initiative, Tamil has found a place among other languages of the world for uniform electronic representation. However, the Unicode character encoding for Tamil imposes significant number of issues to both font and software developers who have been working on their own character encoding for their specific needs. The Unicode Standard 2.0 makes clear distinction between characters and glyphs. This is a big diversion from the one-to-one character to glyph mapping Tamil font developers have been adopting all this while. Not every alphabet in the Tamil alphabet set is assigned a character code in Unicode. Neither does it define all the glyphs required to compose all the Tamil alphabets. This suggests that software developers need to work on specific text-processing algorithms to implement Tamil script in their applications. This paper attempts to provide some insight into the difference between characters, glyphs and fonts and looks at the impact that this model will have on two key areas: i) the use of Tamil on readily available shrink-wrapped software ii) development of new applications in Tamil for commercial use and consumer devices.
Introduction
For a decade or so, there have been numerous independent efforts all over the world to bring Tamil into the electronic media. The lack of a universally accepted standard for Tamil text processing has led to the creation of a wide variety of formats and character encodings. However, most of these efforts were centered on fonts and their use with common word processing and desktop publishing applications. The fonts so developed, had sufficient number of glyphs within them to render Tamil text electronically. In some instances, developers would write keyboard drivers to map Tamil characters on the standard computer keyboard. The applications used however, will have no knowledge of the fact that the language used is Tamil. This had many advantages, the main one being the ability to use commercially available shrink-wrapped software for word processing, desktop publishing and even to some extent data processing "as is".
The simplest possible solution one would think of in overcoming the different character set and input method issue could be to integrate and define a single encoding for Tamil and, though not so critical, standardise on an input method. There are efforts around the globe to do this and some of them are getting to the stage of bearing fruits. The [email protected] mailing list is a clear example of such an initiative.
This approach is important and is required for most of our immediate needs with regards Tamil computing. However, the Internet momentum, coupled with the global shift towards internationalised software development is pushing for the need to have a globally accepted standard for character encoding that encompasses all the languages in the world - past, present and future. Unicode, which is being accepted by most IT organisations, vendors and users alike, makes this possible. Unicode includes all the characters from all major international standards approved and published before December 31, 1990. Tamil made it into Unicode through ISCII (Indian Standard Code for Information Interchange).
Major system software vendors are implementing Unicode today in their native environments. Examples are Windows NT and Java. A data type defined as a character in Java defaults to a Unicode character. This makes software development with Unicode a lot easier as developers need not worry about the risk of disecting a character into two meaningless bytes.
Unicode makes a clear distinction between characters and glyphs. It only encodes characters and leaves it to the text processing application to map the appropriate glyphs for the defined characters and characters formed by combining two or more of the defined characters. Before proceeding further, it may be appropriate to have a good understanding on the difference between characters, glyphs and fonts.
Character, Glyph and Font
Characters are represented by character codes. When a user creates a document and inputs text, the text is represented and stored as characters. Characters are not visible. In other words, a user does not view or print characters. For a character to be viewed or printed (on screen or paper), it must be represented by one or more glyphs. Traditional character codes use a one-to-one mapping of characters and glyphs. In other words every character that is in the character set is represented by a glyph and these are sufficient to render the entire script of that language.
In Unicode, although every character is represented by a glyph, the set of glyphs we need to render the script could be more than the number of characters defined; which is the case for Tamil. This is why we need special algorithms to map sequence of characters into glyphs. For example the sequence of characters � and � will result in the glyph �. Both � and � are defined as characters and they are sufficient to represent �. As such � need not be defined as a character.
A font is a collection of glyphs. Modern font technology, such as TrueType Open, also includes the mapping tables along with the glyphs and can contain glyphs for more than one language in a single font.
Characters and glyphs are closely related, with many attributes in common. However, the distinctions between them make it essential that they be managed by separate entities. ISO/IEC 15285 (working draft) states the characterisation of character and glyphs and their relationship as follows :
A character conveys distinctions in meaning or sounds. A character has no intrinsic appearance.A glyph conveys distinctions in form or appearance. A glyph has no intrinsic meaning. One or more characters may be depicted by no, one, or multiple glyph representations (instances of an abstract glyph) in a way that may depend on the context.The relationship between coded characters and glyph identifiers may be one-to-one, one-to-many, many-to-one, or many-to-many.
Character Sets
A Character set is a collection of characters. However, characters from different language systems can be grouped together to form different character sets. This is done primarily because in the past, character sets can only contain a limited number of characters. Character sets are either single byte or double byte. The section below describes both these.
Single byte character sets
A single byte character set employs either a 7-bit encoding or an 8-bit encoding. A character encoding that uses 7-bits encodes only 128 characters. This is the most universal and is what ASCII (American Standard Code for Information Interchange) uses. The 7-bit code space encodes all the printable characters that we see on a typical US-English computer keyboard. They include punctuation marks, numbers, lower and upper case alphabets and mathematical symbols. As these characters plus 32 other control characters take up the entire 7-bit code space, 8-bit encoding was introduced to include characters other than those we see on the keyboard.
With 8-bits representing a character, we can have a total of 256 characters in a set. The first 128 of the 256 is assigned to ASCII to maintain compatibility, the rest of the 128 (commonly called extended set) is where the action happens for non-English languages. This space is not even sufficient to represent all of the languages required by the European Union at once. As such, many character sets were developed and standardised, each of them having the same characters in the fist 128 space and different ones in the extended space. With the exception of CJK (Chinese, Japanese and Korean), the rest of the world uses an 8-bit encoding scheme.
Almost all Tamil encoding efforts by individual font developers used the extended set. An exception is Mylai which used 7-bits and replaced English alphabets.
The limitation this imposes is, it will not be possible for one to have (as an example) English, French and Tamil in the same document that is stored as plain-text (i.e. without font and other typographical information). This is because; both French and Tamil share the same code space (assuming Tamil characters are encoded in the extended space). It will not be possible to differentiate if an 8-bit character is used for French or Tamil.
Plain text is a necessity because it is about the only form of text representation that is universally accepted for information interchange. Besides, plain text is platform independent, i.e. it can be stored, viewed and manipulated in any word processor, computer system or electronic device.
Double Byte Character Sets (DBCS)
This encoding uses both 8-bit and 16-bit encoding and usually referred to as multi-byte encoding. This is used mostly in CJK environments and uses leadbytes and trailbytes to map characters outside the 256 space. Most of the commonly used encoding schemes from SBCS and DBCS are adopted and integrated into Unicode.
Unicode
Unicode is a 16-bit character set that encompasses many characters used in general text interchange throughout the world. It contains 65,536 possible code points of which a third is still unassigned. Unlike other character encoding standards that assign character codes to both characters and glyphs, Unicode assigns character codes only to characters. Unicode is not a technology in itself. It allows for co-existence of many languages but it does not happen automatically. Tamil is a classic example of this. The number of characters defined in the Unicode table for Tamil is not enough to render Tamil script. As such, in a font that is capable of rendering Tamil, the set of glyphs is greater than the number of Tamil characters defined in the Unicode table. Algorithms that specifically understand the mapping of defined characters to their associated glyphs need to be present in the system for it to fully render Tamil script with Unicode.
Design goals of Unicode
The original design goals of Unicode as defined in the Unicode Standard version 2.0 are as follows :
a. Universal - The repertoire must be large enough to encompass all characters that were likely to be used in general text interchange, including those in major international, national, and industry character sets.
b. Efficient - Plain text, composed of a sequence of fixed width characters, provides an extremely useful model because it is simple to parse: software does not have to maintain state, look for special escape sequences, or search forward or backward through text to identify characters.
c. Uniform - fixed character code allows efficient sorting, searching, display, and editing of text.
d. Unambiguous - Any given 16-bit value always represent the same character.
Implementing Tamil with Unicode
The Tamil character block in Unicode sits in the range of
U+0B80->U+0BFF (which, in decimal, is 2944 -> 3071; 128 locations). The table
below shows just the Independent vowels. The complete table of defined
characters can be found in the Unicode Standard, Version 2.0.
Code Space |
Character |
Name |
0B85 |
� |
TAMIL LETTER A |
0B86 |
� |
TAMIL LETTER AA |
0B87 |
� |
TAMIL LETTER I |
0B88 |
� |
TAMIL LETTER II |
0B89 |
� |
TAMIL LETTER U |
0B8A |
� |
TAMIL LETTER UU |
0B8E |
� |
TAMIL LETTER E |
0B8F |
� |
TAMIL LETTER EE |
0B90 |
� |
TAMIL LETTER AI |
0B92 |
� |
TAMIL LETTER O |
0B93 |
� |
TAMIL LETTER OO |
0B94 |
�� |
TAMIL LETTER AU |
The English names are defined in Unicode for each character and they are
usually preceded with 'TAMIL LETTER' or 'TAMIL VOWEL SIGN'. For example TAMIL
LETTER KA refers to � and
TAMIL VOWEL SIGN E refers to -.
The number of characters defined in the table are insufficient to render Tamil script completely. Only the independent vowels (��� �ള���), Consonants (���� ��� ���-�� �ള���), dependent vowel signs (modifiers) and Tamil numerals are defined. To be able to render Tamil script, the text processing system may map character sequences to their appropriate glyphs. The Unicode Standard version 2.0 deals with all the various combinations for Tamil and as such only a few are mentioned here :
Memory Representation (Defined Characters) |
Display (Glyph Table) |
||
� |
-�' |
� |
-�' |
� |
' |
� |
� |
� |
" |
� |
� |
� |
" |
� |
� |
� |
� |
�- |
|
� |
' |
� |
� |
Glyphs to render composite character sequences can be stored in a glyph
table. From these examples, we can see that there is a considerable amount of
interaction between character tables and glyph tables. The composition and
layout process spans across both the tables. The presentation of glyphs based on
the character sequence requires three primary operations :
a. selecting the glyph representations needed to display the character sequence.
b. assigning positions to the glyph shapes.
c. imaging the glyph shapes on screen or printer.
Glyph selection is the process of selecting (possibly through several interations) the most appropriate glyph identifier or combination of glyph identifiers to render a coded character or composite sequence of coded characters. Deleting of text may take the reverse process.
Using Tamil on shrink-wrapped software
With the complexities around implementing Tamil scripts with Unicode, it will not be possible to incorporate Unicode based Tamil text into off-the-shelf applications unless Tamil text handling capability is built into them. Even internationalised versions of these applications are written for specific environments and as such may not implement all the languages defined in Unicode. (It is also not a requirement for any application to implement all the code sets to be Unicode compliant).
In addition, the operating system should provide support in the input method as well as at the file system level in order to have applications that are Unicode compliant. Most major operating system vendors have announced support for Unicode and some have implemented it. But not all of them are expected to implement support for Tamil.
Developing "Tamil Aware" Applications with Unicode
It is possible for new applications to be developed for Tamil based on Unicode today. However, attention should be given to all text handling functions so that they do not break the 16-bit character (also known as wide character) rules. One simple example of this will be not to assume that a byte is always equal to a character and not to perform pointer arithmetic as done with 8-bits. The advantage of developing on Unicode is the application can be easily moved around for other environments.
Non-graphical devices
From the sections above, it can be seen that Unicode requires a complex character to glyph substitution processes. As such it is more suited for sophisticated high-end desktops that has graphical displays and the capability of storing and rendering fonts from disk or memory.
Character rendering on non-graphical terminals are usually done through firmware (i.e. implemented in hardware). These devices do provide very limited banks of memory to load and render user defined fonts. Currently available devices are designed for simple text-entry and point of sale applications. Most of these just implement plain ASCII and fill up the extended space with line drawing characters.
Although graphical displays are taking over character terminals in most areas, the use of character terminals cannot be discarded in this part of the world. Especially in high volume areas where graphics devices cost substantially higher than character terminals.
As such a one-to-one (character to glyph) encoding is still required for Tamil. Though there are a few available, these efforts need to be synchronised and the standard made completely open for anyone to use without any kind of legal binding.
Conclusion
Unicode is where the future is. Although there aren't that many environments that are Unicode ready today, they will be within the next few revisions. There will also be environments that will not move to Unicode. To survive both these environments, we need to define a single-byte character set for Tamil that fits into the most commonly used 256 code space while we develop software for the future. This will ensure that the usage of off-the-shelf shrink-wrapped software can be continued; only in a more standardised and consistent manner.
References
Speakers'/Authors' Profile
Name : Muthu Nedumaran Email : [email protected]
Occupation : Technical Marketing Manager, SunSoft, Sun Microsystems Asia South region.
Tamil Computing Experience :Over 11 years. Developed the first Tamil interface on a PC with hardware modifications and assembler level device drivers in 1986. Presented a working system at the 6th International Conference on Tamil Studies in K.L., 1987. Developed MURASU range of Tamil desktop publishing interfaces for Windows and Unix. Developed MURSU Anjal in Jan 1995 - the most widely used Tamil email interface (over 35,000 users to date). Co-Founded Tamil.Net - Largest Internet mailing list with Tamil email exchange (over 300 subscribers). MURASU is use by almost all Tamil publications published in Malaysia and Singapore.
IT Experience : Over 12 years in IT. Currently with Sun Microsystems as regional technical marketing manager. Responsible for Internet/Intranet related software technologies and solutions. Awarded International Systems Engineer of the Year for 1996 by Sun Microsystems Inc. in Sydney, Australia.
Other Activities :Life member, International Association for Tamil Research, Malaysia.