Ascii-only Languages
This is a perverse topic given this blog’s positively fanatical obsession with Unicode, but it’s fun nonetheless:
What languages have orthographies that can be written entirely in ASCII?
After a bit of messy experimenting, I have discovered that the answer seems to be “quite a few,” at least dozens, probably.
Anyway, here’s my grubby .txt file if you’d like to take a look:
I generated the list like this: I went through all the documents in the Universal Declaration of Human Rights, and grabbed the set of all the letters in the document, assuming that was at least a reasonable approximation of the alphabet of each language. Then I checked to see if it was a subset of ASCII. That is all.
(And I was disappointed to discover that Marshallese, although it does appear, no longer seems to use the ampersand sign as a vowel. Alas.)
I’d be interested in hearing about errors, being pelted with angry orthographical rants, etc, etc.
12 comments.
Technorati tags: Fun
What about English?
Good question Simon — I think there’s an error in my script. Sit tight, I’m trying to figure out what happened. Perhaps there are even more languages that belong on the list…
**sound of monkeys banging on keyboards**
Okay, I found it. There was a non-ASCII hyphen in the English text: U+2010.
I rebuilt the list, English is now included, along with Swati, Xhosa, Northern Zhuang, and Zulu, which had the same non-ASCII hyphen English had.
This shows a couple things: firstly, it’s not so wise of me to try to define an “alphabet” in terms of a single file, even one of significant length; and secondly, ASCII is a perfectly horrible encoding to use for prose in just about any language, even for languages for relatively simple “alphabets,” such as English or Zulu.
Thanks for your comment!
It’s also worth pointing out, by the way, that I’ve found several instances where the orthography used in the Universal Declaration of Human Rights is only one of several possible orthographies—Ilokano, for instance, apparently has an older (or competing?) orthography which contains several Spanish-esque accent marks.
(Random aside: I was interested to discover that the orthography in the Basque document doesn’t use ñ. In the discussion page of the Basque language Wikipedia article there is some discussion of various orthographies, but I’ve been unable to determine if the ñ is a part of “Standard Basque” (euskara batua) or not. Anybody know?)
And if you take into account punctuation used for each language, how many ASCII only languages remain?
Hi Andj,
Hmm, not sure I understand. Presumably, any language that uses non-ASCII punctuation wouldn’t have made the current list.
Hawaiian orthography, for instance, requires the use of a:
U+02BB MODIFIER LETTER TURNED COMMA
So Hawaiian’s out. (Actually Hawaiian’s also disqualified because it uses macrons over vowels, but the point remains.
U+02BB is called okina in Hawaiian, incidentally. Since that symbol can cause some keyboarding woes in under some operating systems, it’s sometimes written with a normal ASCII apostrophe:
U+2018 LEFT SINGLE QUOTATION MARK
If Hawaiian were to be written with that for okina and macrons, it would make the list.
Is that the sort of distinction you had in mind?
Anyway,
Yet another random aside: check out this documentation of ampersand as a vowel in Marshallese.
Hah! Hawt.
Isn’t ‘Iruña/Iruñea’ a standard Basque spelling?
By the way, it’s sometimes wrong to assume the used characters are the fixed alphabet of those languages.
Many minority languages don’t have fixed orthographies or have some kind of academic orhtographies which are not widely used. Many of those languages have localization/internationalization issues with technology, i.e., it’s impossible to type the “proper” orthography on a regular computer.
For exemple tonal languages like Bemba, Koongo, or Luba sometimes have orthographies using non ASCII characters. Doesn’t Malagasy use the n with umlaut?
Heya Denis,
Yeah, I’m not sure about Basque… I chatted with Languagehat about it and he was surprised as well — he said the resources in his sizeable library all included ñ. I think I’ll drop a line to Luistxo and get the skinny on current trends.
You’re absolutely right to point out that the assumption that all these languages have a fixed orthography. It would be interesting to go through the list and see which of the included ones don’t have an official or significant usage. There’s also the fact that the UDHR has only been translated into certain languages, and only a subset of those have made it onto Eric Muller’s UDHR in Unicode site, my source. So yep, neither comprehensive, nor authoritative, but interesting nonetheless. ☺
Interesting about Malagasy, I’d not heard of that. Consulting the oracle:
Come to think of it, another interesting experiment would be to try to fish out which letters or letter/diacritic combinations are very rare or unique to a single language in the collection…
In the spirit of digression, I wanted to point out a cool ascii hackthography: Uzbek Latin orthography actually utilizes the ascii apostrophe as a combining character, doing the work of a diacritic on several vowels and at least one consonant. Punctuation, reloaded.
Ñ is part of the standard Basque alphabet. Seldom used, but it is used, certainly.
The ampersand in Marshallese was never, as far as I know, part of the orthography used by the Marshallese people. The ampersand was part of Bender’s quite minimalist phonemic guide. However, Katzner put the phonemic system in “Languages of the World” instead of one of the practical orthographies, hence the legend that Marshallese uses &. In fact, almost all traditional orthographies use an umlaut/diæresis or some other diacritic, and the current official orthography has lots of diacritic marks, and no &.