Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

A Unicode Question: Character Decompositions

Written by Patrick Hall, 7 months, 3 weeks ago.
Tags: , , , , .

I fancy myself something of a Unicode fanatic, but I don’t pretend to understand all, or even most, of the specs on the topic. I’m very much a learn-as-I-go kind of guy, which I think is an okay way to learn Unicode stuff, since I pretty much deal with it every day.

End of preamble, beginning of post-preamble:

Some letters can be automatically broken down (”decomposed,” I think, is the right term) into more characters, some of which don’t normally stand on their own.

For instance, here’s a Thai “letter”:

ด้

It’s actually a consonant plus a vowel symbol, and it’s possible to rip those two parts out and look at them:

U+0E14: THAI CHARACTER DO DEK (ด)

U+0E49: THAI CHARACTER MAI THO (◌้)

As you can see (well, as you can see if you have a pretty complete Thai font), there is a “letter” called Do Dek, and another called Mai Tho. Mai Tho is the vowel, and it’s attached to Do Dek. As a loose analogy for Roman alphabet fans, it’s as if an i with a dot and an i without a dot were distinct letters.

Come to think of it, they are, in Turkish: U+0131: LATIN SMALL LETTER DOTLESS I (ı).

But anyway, The point is that sometimes from a linguistic viewpoint you want to do this ripping apart. For an automatic transliteration project I’ve been working on(about which more later), it will be useful to be able to access this kind of info for Thai; it sort of turns an abugida into an alphabet.

However, it doesn’t seem to be the case that such decompositions are universal in Unicode land. The specific case I have in mind is Amharic, which is also an abugida (that’s the language that the word comes from, as a matter of fact), for which there are appears to be no decomposition.

That is to say, there is no way to decompose the characters:

U+1200: ETHIOPIC SYLLABLE HA (ሀ)

U+1201: ETHIOPIC SYLLABLE HU (ሁ)

…in such a way that we can “get at the vowel parts” as independent characters, and see that they are both variant in some sense of the “h part.”

So:

  1. Am I wrong about Amharic?
  2. Is this sort of thing purely script-specific in Unicode, or is there a general policy that says “decompositions should be available if possible”?
  3. If the answer to #2 is “there is no general policy,” is there at least a list somewhere that will tell me which writing systems do and do not have such decompositions?

5 Comments for 'A Unicode Question: Character Decompositions'

  1. Comment received 5 months, 3 weeks ago from Jim Allan

    You can “get at” the vowels only for scripts where the vowels are encoded as separate characters. In the Ethiopian script the vowel and consonant parts of the characters don’t appear independently as part of words in the languages written in the Ethiopian script and therefore there is no reason for them to be coded separately in Unicode. Using parts of these composed letters isn’t done. The supposed “letters” you want to use don’t exist in any Ethiopian writing system, so far as I know. So it is no defect that they don’t exist in Unicode representation of Ethiopian which effectively treats it as though it were an alphabet which saves money on creating fonts which use the characters.

    About all you can do is use a font editor to create your own part-letters in the private characters section of Unicode within a font. You can share any document you create with such characters on the web using PDF.

    If you can find use of these part characters within books in Amharic, linguistic books or teaching books perhaps, then you probably can use that as documentation to persuade the Unicode consortium to encode these part characters, because they have indeed been used, though it might take two years or longer to get such characters through the various bodies that must approve them.

    I don’t know of any list of scipts where you can separate vowels, but I don’t think that is needed. Just browse the charts at http://www.unicode.org/charts/ for any script that interests you and you can easily see whether vowels are listed separately from consonants.

  2. Comment received 5 months, 2 weeks ago from 28481k

    @Jim Allan: Indeed, Amharic/Ge’ez treats its script as a syllabry rather than an abugida or an abjad, so the vowel addition (which may not be at the same place anyway) is an integral part of each character. There is simply no good reason to discern them.

    However, I’ve also heard that in order to achieve texting in Amharic, Nokia has to tweak internal coding such that vowels are separated from consonants in the composition stage until the syllable to produced. e.g., if you want a syllable HI, you first press the button for HA, then you type another keystroke for I, then HI would be produced. I wonder whether such intermediate input stages should be reflected in the encoding though.

    Besides, not just Amharic “suffered” from this problem, another graphemically designed syllabries for Native Canadian Languages also don’t have decomposition from their base character. For example, ᐊ is A, ᐃ is I and ᐅ is O. These three are separate encodings without any decompostion as well…

  3. Comment received 5 months, 2 weeks ago from Jim Allan

    At one time Unicode was going to implement typewriter Ethiopian, until they learned better. You will find a lot of discusson about this on the web, including claims that those who favored typewriter Ethiopian were traitors attempting to kill the Ethiopian script and that typewrite Ethopian is not Ethiopian at all, the last obviously nonsense. It was a reasonable form of the Ethopian script if you were limited to a typewriter keyboard.

    Canadian syllabics don’t have vocalic decoding characters, since directionality of the character indicates the associated vowel sound. How do you show directionality on the screen? I suppose particular keyboard entry software could use the arrow characters if the characters were to be entered in two key strokes, and the ^ sign for the superscript character in the set.

    But, if you can already insert the full characters into a text stream via a keyboard, I don’t see that any script is “suffering”. Ideographic scripts and syllabaries (including structure syllabaries) don’t have separate vowel characters, at least not separate vowel characters identical to portions of the syllabic characters. If Babylonian Akkadian has got along without them for 3,000 years, it probably dosn’t need artificial vowel characters now.

    The Unicode people are *very* good about encoding any characters that can be shown to be used in a plain text writing or printing environment. But I don’t see them encoding any characters not found in any texts, save perhaps in pictorial graphic form very rarely. They would probably say that if you can’t find that the semi-characters you want have ever been used in text, it’s probably because no-one ever needed them or wanted them. They get somewhat fed up with various cranks with new characters that they are sure mathematicians or others would jump to use if they were avaiable. If indeed Nokia and others are using keyboard input now, without any extra Unicode characters, that would seem to confirm that these characters are not needed. Half-characters that may flash on the screen during input but never appear in text is not what Unicode is about. If such part characters are desired only for input and not for text, then they should be provided by the input routine. You imply that is being done now, with no difficulty and there are various Amharic keyboards easily obtainable for PCs.

    But I’m not the one you have to convince, which you have failed to do. It is the Unicode people. You can either make a submission directly or talk it over with them first on the Unicode email list which you can subscribe to at http://www.unicode.org/consortium/distlist.html . But without samples of the characters you want being used somewhere, you have, I believe, no chance whatsoever of them being accepted.

  4. Comment received 5 months, 2 weeks ago from 28481k

    Jim Allen:

    I might have sounded complaining in my last reply but I actually believed that the current solution for Ethiopic and United Canadian Syllabary is probably the best way to encode them because that’s what the users wanted (I didn’t know there was a Typewriter Ethiopic encoding model!) and it can create many troubles in displaying them correctly if they were not individually encoded.

    I think what Nokia would do is to have an intermediate typewriter-like composing stage for texting as a telephone keypad is too small to even accommodate all header characters (those ended with -A). Of course, those “half” characters may not have to be displayed anyway because what people want is a complete syllable.

    On Canadian Syllabary, short of using Variant Selectors there is no good way to “decompose” a character as Canadian Syllabary is graphically based. That alone bars a possible “composing” model unless there is an efficient way to demand full functionality of variant selector for contextual shaping for all Display Engine.

    The difficulty in determining and squeezing basic components of a Han character is also why each Han character is encoded separately (with unification) in Unicode or indeed any other Chinese computer codings. Chinese characters alone used about 75% of the current allotted Unicode codings, so I shouldn’t complain that Ethiopic or United Canadian Syllabary uses mere hundreds of codespace. :P I don’t have much beef about Han Unification like Japanese either: because simplified Chinese characters simply couldn’t in any graphical sense be unified with traditional characters, the effect of national glyphs being unified is less than likely. Of course, should one really want to force a certain national variant in all context, I’d suggest using Variant Selector as a key to achieve that. Too bad the current variant selector proposal made by Adobe Japan is less than ideal, they simply encode tons of minute typographical differences to the Unicode reportore with some pseudo-encoding to bypass the IRG (luckily, some of the more glaring ones were caught by the IRC and got fast-tracked into CJK BMP space).

  5. Comment received 3 months, 3 weeks ago from Jim Allan

    A recent comment by Ken Whistler of the Unicode Consortium:

    “It is a little like asking what is the Unicode
    character that should be used for the various dots,
    circles, lines, and other marks that occur in
    various positions around Canadian aboriginal syllables,
    U+1400..U+1676. They simply aren’t analyzed as separate
    characters. So if you were trying to represent
    text which was essentially metatext talking *about*
    Canadian aboriginal syllables, and which had little glyph
    parts on boxes to represent the various dots and such
    with respect to the core graphs of the syllables,
    you wouldn’t have (or really need) characters for
    those, either.

    The ALA-LC romanization tables are metatext about
    writing systems, and in some instances, as for the waslah,
    they are talking about *parts* of the glyphs for characters,
    and not characters per se.

    I think “if you were trying” in this post is a typing error for “unless you were trying”.

    This text is found at http://www.unicode.org/mail-arch/unicode-ml/y2008-m03/0025.html .

    The ALA-LC romanization tables are available at http://www.loc.gov/catdir/cpso/roman.html .

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.