Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

A Nice Language Switching Widget

Written by Patrick Hall, 2 years, 6 months ago.
Tags: , , , .

I’ve been looking out for ways to format multilingual content in blog entries—we intend to eventually start having translated content in the present blog as a test case.

Simple interfaces are the best, and I ran across a simple intuitive, implementation of “language toggling,” if you will:

seweso’s blog: Google base

You can see from our astonishing animated GIF technology that the content changes right there in place: click on the language you want to see, there it is, right there.

It hardly even merits the title of “widget.” But the best interfaces are like that—they’re so simple you might not even notice them, or even if you do, you don’t stop to think that there could be another way.

I like this because the text remains in place, and you don’t have to wait for the whole page to reload, since it’s just toggling the visibility of two divs with some Javascript.

  • Somehow highlight the name of the “current” language—put the language names in tabs or something?
  • It should be possible not just to toggle between languages in a single post, but also to save preferences for the blog as a whole.

Some techier notes…

That said, the Javascript itself is inline: it’s coded right into the onclick attribute of the links. I poked about in the navigation of Wouter’s site, and this language toggling functionality doesn’t seem to be applied to other posts on seweso (they seem to be either in Dutch or English, but never both as in this post). So this was probably a quick solution—and it works very well as such.

For a blog that has regular multilingual content, then (and possibly some views in Blogamundo’s aggregator), this kind of functionality would be available on every post.

Which would probably involve one final change:

  • Factor out the Javascript into external files that can be used for more than one post.

This leaves the question of how to mark up all this stuff. I believe it’s possible to select divs with Javascript by referring to the lang attribute, so maybe it’d be something like… er, this is a whole ‘nother ball of wax.

I’ll get back to you, we’ll do lunch.

How We Can Help: Persian

Written by Patrick Hall, 2 years, 6 months ago.
Tags: , , .

I’m starting a new category (Applications) to point out when other bloggers mention the need for translation. First up, Global Voices Online contributor Hossein Derakhshan (better known as Hoder):

E:M | Khamenei and Rafsanjani: The split is now public

There are lots of things like this happening in Iran these days which remain completely out of the world’s radars, especially to those who can’t read Persian.

Read what I wrote about this, as well as people’s comments, on my Persian blog.

When Blogamundo gets going, we want to help the Persian blogosphere (and many others) to organize translation efforts to bring their experience to a wider audience.

(And I hasten to add that this doesn’t mean only translating into English.)

Site Update: Still Tuning Spam Blocking…

Written by Patrick Hall, 2 years, 6 months ago.
Tags: .

We’ve installed a small battalion of anti-spam plugins on this site, and we’re still training them a bit. So, if you leave a comment and it says your comment has been classified as spam, sorry. It will be added quickly!

Comments are most certainly welcome!

I Love Unicode SO MUCH…

Written by Patrick Hall, 2 years, 6 months ago.
Tags: , .

…that I once emailed Ray Kurzweil, bewailing the fact that he had a website for a book about the future encoded as (*gasp*) ISO-8859-1.

Haven’t heard back yet. ☺

Sharing exceptions between Python and Ruby

Written by Jonas Galvez, 2 years, 6 months ago.
Tags: , , , , , .

An essential piece of software behind Blogamundo is a full-fledged multi-user RSS (and Atom) aggregator. We’re building the web UI with Rails, so we were naturally looking forward to writing our feed poller in Ruby. As it turns out, though, there aren’t really many RSS parsing libraries for Ruby, and the best ones still would require quite a bit of fiddling in order to be suitable for our purposes. After playing with a bunch of libraries I was quickly convinced of something that in reality I already knew: there isn’t anything better (or as hackable) as Mark Pilgrim’s Universal Feed Parser.

At first thought, having our web UI written in Rails and our feed poller in Python involves a trivial interoperability scheme: the poller loads parsed feed data into the database which the web UI can then query afterwards. The problem arises when we need to parse feeds directly from within the web UI and provide the user with immediate data. We needed a reliable way to make Rails communicate with the Python feed poller.

After much pondering about inter-process communication (and how they scale) I figured the simplest and perhaps most scalable way to do this would be to use HTTP. Make our poller run in a local HTTP server which Rails can query. The data format I chose, of course, was YAML. Currently our poller is running in Apache with mod_python but it can be made infinitely scalable via FCGI or SCGI.

So, here’s a code demonstration of what just decribed:

# python code
import sys
sys.stdout.write('Content-Type: text/yaml\r\n\r\n')
sys.stdout.write('{yaml_test: Hello, some_property: 2}')
	
# ruby code
require 'yaml'
res = YAML:: load open('http://127.0.0.1/test.py').read

Staggeringly simple (and stupid, some might say). This could result in a slew of different errors, so we also needed a reliable error handling scheme. It quickly made sense to have our error codes defined in an external file, and inspired by this post from Sam Ruby, I’ve written code to parse and dynamically generate exception classes in runtime, both in Python and Ruby. Here’s the relevant code:

# config file
MyExceptionX:
    code: 1
    message: This is MyExceptionX
MyExceptionY:
    code: 2
    message: This is MyExceptionY
MyExceptionZ:
    code: 3
    message: This is MyExceptionZ
# python code
import new
import syck # YAML parser for Python
	
ERROR_CODES = syck.load(open('config/errorcodes.yml').read())
	
class MyException(Exception):
    pass
	
for k, v in ERROR_CODES.items():
    nexc = new.classobj(k, (MyException,), {})
    nexc.code, nexc.message = v['code'], v['message']
    tostr = lambda self: \"{code: %s, message: %s}\" \
             % (self.__class__.code, self.__class__.message)
    setattr(nexc, \"__str__\", new.instancemethod(tostr, None, nexc))
    globals()[k] = nexc
# ruby code
	
ERROR_CODES = YAML::load open(\"config/errorcodes.yml\").read
	
module MyExceptions
    class MyException < StandardError
    end
    CODES = []
    for k, v in ERROR_CODES
        CODES[v['code']] = const_set(k, Class.new(MyException) {
            define_method :to_s do
                \"{code: #{v['code']}, message: #{v['message']}}\"
            end
        })
    end
end

I hope this proves useful to someone. I’m really not certain what the best approach to this problem is, but this one proved to be very reliable during testing. Suggestions are of course very welcomed.

A Short Story about Machine Translation

Written by Patrick Hall, 2 years, 6 months ago.
Tags: , , .

Once upon a time, there were a bunch of universities who were drafted to participate in a “Surprise Language Machine Translation exercise.”

The exercise went like this:

”We’re going to tell you the name of a language, and all you machine translation departments will have one month to get something out of that language that more or less resembles English.”

The mystery language turned out to be be Hindi, and the boffins got in gear.

Now, when you have a massive nationwide brain trust including people like Franz Josef Och, Philip Resnik, and Dan Melamed collaborating and competing to build a machine translation system, you will get results—the best in the business.

(Mostly notwithstanding the muddlings at the not-so brain-trusty rungs of the ladder, where yours truly munged away as a Perl apprentice.)

Of course, the best in the machine translation business still isn’t exactly limpid prose… but then it’s not entirely useless, either (and it’s getting better).

Point being, there was plenty of output at the end of the month.

They’re so Bleu

Now, here’s something you might not know about MT: getting the system to produce output is only the beginning of the work. Then comes the painstaking stage called “evaluation,” where you have to compare the outputs, and decide which is best.

This is usually done by comparing the translations to a translation by a real live human—this is called the “gold standard.” In fact, the most common metric for evaluating MT systems is surprisingly simple: you essentially count which MT output has the most strings of words in common with gold standard translations. Whichever has the most, wins. (You can read about that metric here if you’re curious, it’s called “Bleu.”)

But anon!

There were no gold standard translations for Hindi»English. That was the whole point! If there had already been good, carefully translated texts, well, the suprise language wouldn’t have been terribly surprising.

After all, these guys already had systems for doing translations between famous language pairs like French»English or English»Spanish. (You know what I mean — the dialects with armies and navies.).They could have just fed the systems Hindi»English content, and whammo, MT system. But there just wasn’t enough such translated content to be had.

Faced with this dilemma, the boffins did what they do best: they came up with a clever hack. They simply put up a website with some texts in English, and rewarded amateur translators on the internet with gift certificates to Amazon.com if they would translate those texts into Hindi.

And here’s where the story gets interesting.

Forests and Trees

Lo and behold, the response was overwhelming. And within just a few days they had stacks of gold standard translations—far more human translations than they needed to function as a gold standard, in fact.

So the boffins thought that all was well and they went back to tinkering with the real problem: their MT systems… And they happily labored, knowing that they would be in possession of a scientifically sound point of comparison at the end of the month… lots of good Hindi»English translations, with which to compare their not-so-hot machine translations.

Now, stop and think about this a minute, O Children of the Age of Wikipedia.

Can you feel the irony?

Update: Speak of the devil — if you’re curious to read more about how statistical machine translation works (and how it relies on the existence of translated content), check out Translation by Numbers at Technology Review.

Using Search Engines to Find Blogs by Language

Written by Patrick Hall, 2 years, 6 months ago.
Tags: , , .

One key part of Blogamundo is going to be an aggregator. And of course, it will be multilingual. And when you put “multilingual” and “web” into the same sentence, sadly enough, you’re going to have to deal with encodings, too. Or rather, you’re going to have to end up converting everything which isn’t, er, pure, to The One True Encoding.

Which will involve testing, which involves finding a bunch of blogs in some particular language in whatever encodings that language happens to be written in.

Now, all of these noble goals aside, the fact is I think looking for blogs in languages I don’t know is… well, kind of fun really.

So thought I’d write down a few approaches we’ve used, starting with the a blatantly obvious.

Comments are of course welcome…

Let’s take, oh, I dunno, Turkish as an example. Obviously the first stop is search engines.

Blog search engines

This idea is painfully simple: just pick a word that is likely to show up in the language in question. City names work well.

→ Technorati

They’ve got language identification stuff built in, but let’s just start with a likely keyword:

Technorati search ‘ankara’

Searching for “ankara”—sure enough, we get some hits there: one, two, three
(These three all on MSN, interestingly enough.)

Restricting to Turkish works as expected:

Technorati search in Turkish ‘ankara’

→ Google Blog Search

Here’s the equivalent search on Google’s blog search thingie, also restricted to Turkish:

Google Blogsearch in Turkish, ‘ankara’

More Turkish results: one, two, three

→ Icerocket

Icerocket ‘ankara’

Works as well, but there is no language restriction option on Icerocket.com, apparently.

Well that was simple.

Turkish was easy; there seems to be plenty of Turkish blog content out there, and two out of three search engines have an option for restricting searches to Turkish. (Wikipedia tells us that there are something like 60 – 75 million Turkish speakers—it’s a pretty big language.)

So this game wasn’t too challenging. Between those three search engines we would probably be able to come up with maybe 100 URLs of Turkish blogs, just by digging around in blogrolls and or writing a script to spider them (or using an existing script like Sean B. Palmer’s).

Next time we’ll look at building such a list, and then try to get a bit of info about Turkish word frequency, which we can then feed back into our searches. After all, not everybody’s blogging about Ankara.

But is it really?

But we’ll also take a closer look at the results we’ve gotten: are they really all in Turkish? I suspect that we’ll find some edge cases sooner or later.

Global Voices chat: log and recap

Written by Patrick Hall, 2 years, 7 months ago.
Tags: , , .

You can read the log of our chat from the #globalvoices channel here: Global Voices / Blogamundo IRC Meeting Log.

It’s a bit long; if you’re in a hurry you might want to come back later tonight as I’ll be updating this post with a summary of key points.

Thanks, Rebecca, for setting up the chat!

(And as long as we’re on the subject of chat, we’ve set up our own channel on freenode at #blogamundo, feel free to drop by.)

Summary

So after having reread the logs of today’s irc discussion, I realize that I got a bit confused what with all the ideas flying around. I think I’ve got all the main concepts boiled down to just two goals:

  1. Outbound translation How to help their multilingual readers to translate Global Voices content into other languages
  1. Inbound translation How to organize efforts to translate content in many English for the benefit of all Global Voices readers—for the time being, this would mean organizing translations into English

Rebecca has a workable plan for a near-term solution to #1: it involves using special tags on del.icio.us to label outbound translations, and then aggregating those translations at Bloglines. We agreed that this is a good starting point, and in the coming week or so we’re going to see if we can come up with some ideas for streamlining that process.

I’ll have more to say about this tomorrow. I’m also be looking for suggestions and further comments… more soon!

It was a lot of fun meeting (and re-meeting) everyone from GVO, can’t wait to see what we can come up with.

(This is the part where I go to sleep. ☺ )

How can Blogamundo help Global Voices?

Written by Patrick Hall, 2 years, 7 months ago.
Tags: , , .

These are some introductory notes for the October 17th IRC meeting with the folks from Global Voices—Thanks to Rebecca for organizing the discussion!

Bloga- what?

Blogamundo! “Mundo” is the Spanish word for “world.”

(And Portuguese too, for that matter. Ehem.)

Blogamundo.com is a new project to try to increase the amount of translation among blogospheres. (And, eventually, other -ospheres.)

Global Voices has already done some great work in creating initial links across linguistic divides. The glimpses of other cultures are always fascinating. But I’ve often had the experience of reading a Global Roundup on GV and seeing a tantalizing description of a post, only to see a link to a post in a language that I can’t read.

Doh!

Don’t get me wrong; there is great value in these peeks into other blogospheres, but in some cases the post on the other end of those links is well-nigh crying out to be translated in full.

Why isn’t there more translation happening?

An obvious question is: why isn’t more translation already happening on the web? Or if it is, why isn’t it easier to find?

After thinking about the problem for a while, here’s what I’ve come up with:

  1. Translation is hard work. Joi Ito wrote about that fact a while back. (I’ll disagree with his claim that translation isn’t fun, though… well, it can be fun. More on this later.)
  2. ”CAT” translation tools haven’t hit the web. “Computer Aided Translation” (CAT) tools exist, but they’re expensive and built on top of proprietary software (with at least one notable exception), so they’re used almost exclusively by professional translators.
  3. Translators don’t get to put their personalities on the web. There are plenty of translators who translate off and on, but to date there’s really no place where translators can show their work online. There should be.
  4. New translators don’t know where to start. And while even a simple CAT tool could help an amateur translator to improve their proficiency, few even know such a thing exists.

Have you ever heard of a CAT tool?

Let’s make it easier

Blogamundo is aimed at addressing these problems. We’re building a simple browser-based CAT tool and a network around it. The network will handle:

  • Aggregating content by language or tag
  • Republishing translated content
  • A mechanism for requesting translation of specific content

And in the longer term:

  • Lexicon collaboration, exchange, export (under a Creative Commons license)
  • A skill-measuring system for helping professional translators show they’ve got the chops
  • A translation market for translators to promote and market their skills

Let’s work together

I think it makes sense for us to talk about a simple task:

How can Blogamundo help Global Voices to organize translation around the “dead end” links in the roundups?