Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Poking around in the Common Locale Data Repository

Written by Patrick Hall, 2 years, 5 months ago.
Tags: , , , , , .

I18ners and l10ners rejoice, there’s a project brewing over at the Unicode.org site with tons of information that will be helpful to you:

Common Locale Data Repository

It contains the information that specific to particular places or cultures, so for instance, you’ll find the names of languages, names for currencies, conventions for writing numbers, stuff like that. Here are a few interesting tables to give you an idea of the sort of info that’s buried in the CLDR:

The project is still getting off the ground (public vetting starts this month), but there is already a pretty impressive amount of information there—here’s a big xml file with all kinds of locale info in French, here it is again in Amharic, and so on.

This isn’t the first such effort (geonames.de comes to mind), but it’s the first site I know of where so much data is available in XML. And besides, the fact that the Unicode Consortium is behind it lends the project a lot of weight.

One feature that caught my eye was “exemplarCharacters .” As far as I can tell, this translates roughly to what people think of as an “alphabet” (although it doesn’t define digraphs or ordering). Here’s the set of exemplar characters for French:

[a à â æ b c ç d e é è ê ë f-i î ï j-o ô œ p-u ù û ü v-y ÿ z]

That set can be thought of as the characters that really should be supported by any software that claims to be able to handle French text.

A couple more examples… Here’s Armenian:

[Ա-Ֆՙ-՟ա-և֊ﬓ-ﬗ]

And here’s Tigrigna (I added the linebreaks):

[ሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍ
ነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕ
ጘ-ፚ፟-፼ᎀ-᎙ⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞ]

Note that these things are actually regular expressions, so [a-z] actually means “all the letters from a to z in the order they are found in Unicode.”

Even so, I’m pretty sure that the “smallest set award” is shared by Cornish, English, Indonesian, Malaysian, Oromo, Somali, and Swahili, which are defined as having just these characters:

[a-z]

And even that may be overestimating the characters you need for some of those languages—it’s my understanding, for instance, that the Swahili alphabet requires fewer characters than that set includes.

Of course, the hands-down winner would be Rotokas, but I guess they haven’t gotten around to defining that one yet.

When they do, it will look something like this:

[a e i g k o p r s t u v]

That’s all folks!

2 Comments for 'Poking around in the Common Locale Data Repository'

  1. Comment received 1 year, 3 months ago from Steven R. Loomis

    You can play with UnicodeSets over at my UnicodeBrowser, for example if you type in the Armenian set [Ա-Ֆՙ-՟ա-և֊ﬓ-ﬗ] into the Set: box on the top right and hit enter you can click any character for more info.

  2. Comment received 1 year, 3 months ago from Steven R. Loomis

    Unicode Announces Start of Submission Period for Common Locale Data Repository, Version 1.5

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.