Poking around in the Common Locale Data Repository
I18ners and l10ners rejoice, there’s a project brewing over at the Unicode.org site with tons of information that will be helpful to you:
It contains the information that specific to particular places or cultures, so for instance, you’ll find the names of languages, names for currencies, conventions for writing numbers, stuff like that. Here are a few interesting tables to give you an idea of the sort of info that’s buried in the CLDR:
- Territory → Currency
- Language → Territories
- Territory → Language
- Language → Scripts
- Script → Language
The project is still getting off the ground (public vetting starts this month), but there is already a pretty impressive amount of information there—here’s a big xml file with all kinds of locale info in French, here it is again in Amharic, and so on.
This isn’t the first such effort (geonames.de comes to mind), but it’s the first site I know of where so much data is available in XML. And besides, the fact that the Unicode Consortium is behind it lends the project a lot of weight.
One feature that caught my eye was “exemplarCharacters .” As far as I can tell, this translates roughly to what people think of as an “alphabet” (although it doesn’t define digraphs or ordering). Here’s the set of exemplar characters for French:
[a à â æ b c ç d e é è ê ë f-i î ï j-o ô œ p-u ù û ü v-y ÿ z]
That set can be thought of as the characters that really should be supported by any software that claims to be able to handle French text.
A couple more examples… Here’s Armenian:
[Ա-Ֆՙ-՟ա-և֊ﬓ-ﬗ]
And here’s Tigrigna (I added the linebreaks):
[ሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍ
ነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕ
ጘ-ፚ፟-፼ᎀ-᎙ⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞ]
Note that these things are actually regular expressions, so [a-z] actually means “all the letters from a to z in the order they are found in Unicode.”
Even so, I’m pretty sure that the “smallest set award” is shared by Cornish, English, Indonesian, Malaysian, Oromo, Somali, and Swahili, which are defined as having just these characters:
[a-z]
And even that may be overestimating the characters you need for some of those languages—it’s my understanding, for instance, that the Swahili alphabet requires fewer characters than that set includes.
Of course, the hands-down winner would be Rotokas, but I guess they haven’t gotten around to defining that one yet.
When they do, it will look something like this:
[a e i g k o p r s t u v]
That’s all folks!
2 comments.
Technorati tags: cldr, i18n, l10n, Language and the Web, regex, unicode
You can play with UnicodeSets over at my UnicodeBrowser, for example if you type in the Armenian set [Ա-Ֆՙ-՟ա-և֊ﬓ-ﬗ] into the Set: box on the top right and hit enter you can click any character for more info.
Unicode Announces Start of Submission Period for Common Locale Data Repository, Version 1.5