Hmm.
Now here is a weblog which definitely stands to benefit from Blogamundo.
One could argue that the verbs I’ve highlighted here in the Bloglines interface are perfectly clear:

If you click it, then the posts beneath that little header bar will change to “oldest first” order. Right? Maybe I’m just dim, but I find myself having to think that through every day.
Now compare that to a typical slideshow on Flickr:

You could argue that that is redundant. But I don’t think so. It’s not redundant if it takes you from “having to think out the logic in your head” to “state of no doubt whatsoever.”
Would it make sense for Bloglines to have a big redundant header like this:
Oldest first (sort newest first)
Personally, I think it would–that tiny change would make me a much happier user.
We’re going to start with an English interface to Blogamundo. Brazilian Portuguese will be next, because Jonas and I both speak it (He’s Brazilian… I’m just a Brazil nut.)
Beyond that, we’re going to try to enlist the help of our users for localization.
There are tons of technical names for all this in linguistics: tense , aspect, and mood are a few. (I think there’s “perfectivity” as well. I’ll have to look that up.) But these sorts of distinctions are formed differently in different languages — in some languages, perhaps, a single verb would be immediately clear.
That kind of info has to be supplied by native or near-native speakers.
6 comments.
Technorati tags: Language and the Web, ui, verbs
Translator Margaret Marks pointed to an article on forensic linguistics. (Here’s a very brief intro at Wikipedia:). Forensic linguistics is the use of linguistic expertise in the law.
This sort of thing has a fairly long history.
It also overlaps pretty closely with what’s called Stylometry, the quantitative analysis of the “style” of spoken or written language. Surprisingly, stylometry goes right back to 1439:
The Italian humanist Lorenzo Valla proved in 1440 that the Donation must be a fake by analyzing its language, and showing that while certain imperial-era formulas are used in the text, some of the Latin in the document could not have been written in the fourth century.
(Speaking of Latin, “nihil novum sub sole” comes to mind.)
There is a very modern consequence of this sort of technology that I’ve never seen discussed. That consequence is the potential use of stylometric techniques to identify the authors of anonymous blogs. It seems to me that sooner or later this is going to happen.
Anonymity in blogging is pretty important these days, and I think it’s important that people who need to blog anonymously understand this simple fact: anonymity networks and encryption aren’t enough to ensure that your identity is anonymous. In the long run, at least, the only way to really ensure your anonymity is to never, ever associate any text you’ve written with your real name.
The scenario would go like this:
It’s a very sticky problem. And the stickiest part is that there’s really no way to completely disguise the way you write. When we use language, we aren’t really even conscious of our style. And even little typographical details may serve to help identify an author—do you use em-dashes?
Stuff like that could put ya in the slammer just like a fingerprint could. From the article:
Among other textual similarities, Mr. Fitzgerald found both the anonymous letters and the doctor’s own writing samples contained similar and unusual spacing between words.
Spacing. Who thinks about that? It’s just habit.
Mathematically, stylometry (or authorship attribution as it’s also known) is really interesting. You can do neat stuff in literature, such as make the case that Shakespeare copped some of his stuff from Marlowe.
But in terms of privacy and blogging, it’s a little creepy.
Something to think about.
2 comments.
Technorati tags: anonymity, blogs, forensic linguistics
2 comments.
Technorati tags: Fun, Language and the Web, tiếng Việt, unicode, vietnamese
Katy Pearce mailed to tell me that Unicode hasn’t really caught on yet in Armenia:
…no one in Armenia likes Unicode because they are all used to using NLS Armenian – the government sponsored system. As long as everyone uses that, they can communicate with one another. … Do you know of any sites that have a simple explanation why Unicode is better?
Yeah, this website! ☺
It’s funny, I’ve seen a lot of tutorials and so on about Unicode, but most of them are written with the programmer in mind—they start off with discussion of bytes and stuff like that.
For non-programmers, there is really only one key point to keep in mind about Unicode, and this is it:

That’s a page from Wikipedia with English, Armenian, and Russian spellings in a single file, in this case a web page.
That file pretty much couldn’t exist without Unicode. No Unicode, no Wikipedia.
That’s the main reason why Unicode is better, and why it’s worth the effort of standardizing on it.
Now, back to the specific example of Armenia, let’s consider an all-too-real scenario where Unicode could make a difference.
Imagine a state health worker checking her email in Yerevan. A Russian-speaking doctor in a clinic in Azerbaijian has mailed. The health worker doesn’t happen to know much Russian, but she knows enough to recognize the phrase “bird flu,” and immediately forwards it to someone who can translate it into Armenian. Then the report can be forwarded wherever it needs to be.
I just made that scenario up, but look at all the scripts in varying degrees of use around the Caucasus:

Any combination of those languages could end up needing to be translated for some urgent reason or another. And in such a situation you want to be able to think of text as text, not as data.
That’s what Unicode does. It puts everyone on the same page (or screen), quite literally.
NLS Armenian and all the other legacy encodings will eventually go the way of the woolly mammoth. And the sooner the better.
7 comments.
Technorati tags: armenian, encodings, Language and the Web, translation, unicode, Հայերեն
I18ners and l10ners rejoice, there’s a project brewing over at the Unicode.org site with tons of information that will be helpful to you:
It contains the information that specific to particular places or cultures, so for instance, you’ll find the names of languages, names for currencies, conventions for writing numbers, stuff like that. Here are a few interesting tables to give you an idea of the sort of info that’s buried in the CLDR:
The project is still getting off the ground (public vetting starts this month), but there is already a pretty impressive amount of information there—here’s a big xml file with all kinds of locale info in French, here it is again in Amharic, and so on.
This isn’t the first such effort (geonames.de comes to mind), but it’s the first site I know of where so much data is available in XML. And besides, the fact that the Unicode Consortium is behind it lends the project a lot of weight.
One feature that caught my eye was “exemplarCharacters .” As far as I can tell, this translates roughly to what people think of as an “alphabet” (although it doesn’t define digraphs or ordering). Here’s the set of exemplar characters for French:
[a à â æ b c ç d e é è ê ë f-i î ï j-o ô œ p-u ù û ü v-y ÿ z]
That set can be thought of as the characters that really should be supported by any software that claims to be able to handle French text.
A couple more examples… Here’s Armenian:
[Ա-Ֆՙ-՟ա-և֊ﬓ-ﬗ]
And here’s Tigrigna (I added the linebreaks):
[ሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍ
ነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕ
ጘ-ፚ፟-፼ᎀ-᎙ⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞ]
Note that these things are actually regular expressions, so [a-z] actually means “all the letters from a to z in the order they are found in Unicode.”
Even so, I’m pretty sure that the “smallest set award” is shared by Cornish, English, Indonesian, Malaysian, Oromo, Somali, and Swahili, which are defined as having just these characters:
[a-z]
And even that may be overestimating the characters you need for some of those languages—it’s my understanding, for instance, that the Swahili alphabet requires fewer characters than that set includes.
Of course, the hands-down winner would be Rotokas, but I guess they haven’t gotten around to defining that one yet.
When they do, it will look something like this:
[a e i g k o p r s t u v]
That’s all folks!
2 comments.
Technorati tags: cldr, i18n, l10n, Language and the Web, regex, unicode
Fellow language-o-phile Katy Pearce points me to The Translator’s Blues - Will I get replaced by a computer program? over at Slate.
It’s interesting to see a translators’s take on whether machine translation is an economic threat to his livelihood, but I pretty much stand by what I said in a previous post, “Is Machine Translation Possible? Well, yeah, but…”
This bit in particular, though, merits comment:
The one that stood out from the pack was Language Weaver. Not only did it recognize the subject as a human being—”The period of his youth was not easy”—but it translated the rest of the paragraph with only one minor error. Intrigued, I began to put the software through its paces. A headline from El Pais [sic]: “A wave of attacks left more than 100 dead in several cities in Iraq.” So far, so good. A speech from the United Nations: “The problem is to maintain the level of international attention and ensure the implementation of the commitments.” Perfect. The first line of Don Quixote: “In a place of the Channel, whose name do not want to remember, has not much Time living a Hidalgo the spearheaded in shipyard, adarga Antigua, Rocín weak and galgo corridor.” Clearly, in the world of machine translation, everything has its limits.
The problem with translation software is context…
Actually I don’t think context is what’s behind the varying quality of these translations.
The problems with machine translation are:
The reason that the U.N. speech was translated so convincingly is because Language Weaver (and every other MT system out there) was trained on U.N. speeches. If it had been trained on a bazillion carefully translated Cervantes novels, the result would have been equally convincing.
Okay, well, maybe that’s not quite true, since U.N. text is far more boring formulaic than Cervantes. Generally speaking, the more repetitive and formulaic the training data is, the more accurate the output of MT will be.
But even so, if the training data had had more lanzas and adargas and galgos in the first place, then it would have sent more lances and bucklers and greyhounds out the other end. If the words are in the training data, then the system will do a good job of figuring out how it should be translated.
After all, there are plenty of instances of the phrases “implementation of the commitments” and tons for “international attention” on the U.N.’s site. Not so many “adargas” .
So it’s pretty clear right there that that kind of text has plenty of available training data.
As far as MT systems are concerned, the rest is mostly (very sophisticated) math.
But such training data simply doesn’t exist between most pairs of languages.
As far as MT systems are concerned, those languages don’t exist.
2 comments.
Technorati tags: machine translation, translation
Dear code monkeys:
In a previous post (with a dumb title, it now occurs to me), I wrote about my obsession with testing everything with non-ASCII data very early in the game: If you are writing an application that deals with multilingual content…
Here’s an ickle case study of that policy, hot off the command line.
I was just reading a neat Ruby/XML tutorial over at XML.com called Creating XML with Ruby and Builder, and there is a bit of code that looks like this:
#!/usr/bin/ruby
require 'rubygems' # had to add this, not sure why
require 'builder'
favorites = {
'candy' => 'Neccos', 'novel' => 'Empire of the Sun', 'holiday' => 'Easter'
}
xml = Builder::XmlMarkup.new( :target => $stdout, :indent => 2 )
xml.instruct! :xml, :version => "1.1", :encoding => "US-ASCII"
xml.favorites do
favorites.each do | name, choice |
xml.favorite( choice, :item => name )
end
end
So, the first thing I did was shudder in terror at the bit that says :encoding => US-ASCII, and then promptly copied the sample code from a file called favs.rb to ufavs.rb. So I changed that to read :encoding => “utf-8″, and then I edited the favorites hash to include some test data in Japanese:
favorites = {
'作家' => '村上春樹', 'キャンディ' => 'ネッコ', '小説' => 'ノルウェイの森', '祝日' => 'お正月'
}
And I ran it, and in this case it seems to work just fine. On my Linux box, anyway, and in my particular shell which is set to UTF-8 by default and…
There are a thousand variables. That’s why it pays to start testing early—as in, immediately—with non-ASCII data.
So typically I will do that when I’m testing little one-off files — I copy a file to u-<whatever> and go off to find some data in a random crazy moon language, edit it up, and see if everything’s still working.
Seem to have gotten lucky, this time.
(And I should point out that I’m not trying to rag on the guy who wrote the article, to the contrary, it’s great, I’m just trying to promote this idea of always testing with data that’s not just ASCII. )
No comments yet.
Technorati tags: Code, japanese, Language and the Web, ruby, unicode, xml, 日本語
I believe this is the first internet domain directly related to a language:
Here’s the announcement (translation from Catalan mine, corrections solicited and probably necessary ☺):
Benvinguts al .cat! El passat 16 de setembre ICANN (Internet Corporation For Assigned Names and Numbers) va aprovar definitivament el domini .cat per a la comunitat lingüística i cultural catalana. És a dir, un domini destinat a representar a Internet els usuaris que s’expressen en català o que estan en relació directa amb la promoció de la llengua o la cultura catalanes, fins i tot expressant-se en d’altres llengües. El domini servirà per a ressaltar a nivell global l’existència de la nostra llengua i cultura, del nostre barri d’Internet.
Welcome to .cat! Last September 16th ICANN (Internet Corporation For Assigned Names and Numbers) approved the .cat domain for the Catalan linguistic and cultural community. That is to say, a domain which will represent Catalan-speaking people or those who are involved directly in the promotion of Catalan culture or language on the internet, including those who express themselves in other languages. The domain will serve to emphasize the existence of our language and culture on a global level, and of our section of the internet.
There’s a rather funny observation at the Spanish Wikipedia article on the .cat domain (translation mine):
La ICANN ha prohibido expresamente que se utilice el dominio .cat para páginas de gatos (cat en inglés), a no ser que estén en catalán o tenga que ver con la cultura catalana.
ICANN has expressly prohibited that the .cat domain be used for pages about cats (in English), unless the pages are in Catalan or have something to do with Catalan culture.
The Catalan article also points out that the .cat domain will support internationalized domain names — so URLs in .cat will be able to contain accents.
8 comments.
Technorati tags: Catalan, Language and the Web