Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Major Tom to Ground Control…

Written by Patrick Hall, 2 years, 2 months ago.
Tags: .

Golly, we’ve broken our 3-day limit between posts rule twice over (almost thrice!). As a result, I halving the cost of subscription to this blog.

So there.

In point of fact we’ve been grinding away behind the scenes here at Blogamundo Galactic Headquarters… if all goes well we’re going to be using our tools for the upcoming A Carnival of Blog Translation … just 3 days away. Can the public launch be that much further behind?

Well probably.

Eheh.

Can’t talk now, must go hack some more!

A Carnival of Blog Translation!

Written by Patrick Hall, 2 years, 2 months ago.
Tags: , , .

Via Chris (dig the new layout!) and in turn via Language Hat, a very cool project which jibes very well with the spirit of our own project: Liz Henry is organizing a “Carnival of Blog Translation”:

On the day of the Carnival, a participant translates one post by another blogger, and posts it on her own blog with a link to the original. She would need to email me, or post in the comments right here, and I’ll compile one big post on the day of the Carnival with links to all the participants.

These are clearly our kind of folks! ☺

There is no restriction on topics; I think the aim is to try to get as much cross-language blog flow as possible. I think I myself will try translating something out of Welsh… I would do Brazilian Portuguese but there is as least one participant who will probably be doing Portuguese already: Bev Trayner herself, who first came up with the idea and described it at BlogHer blog.

Bev’s blog, Em Duas Línguas (In Two Languages) has interesting observations on bilingual blogging. This one particularly caught my attention:

…I continue to marvel how navigating the online world em duas línguas is more than the sum of navigating it one of two languages.

We couldn’t agree more.

And we’re hoping that Blogamundo (which will be up and running RSN) will help to approximate this experience across more language pairs than anyone could hope to learn. (Even language nuts!)

We’re planning some features designed to shoulder some of the organizational load of efforts like this. I’m really looking forward to seeing how this project progresses. (I’m also dropping a line about this project to the Global Voices Online mailing list, which has tons of multilingual bloggers.)

Wikipedia’s Interwiki Links

Written by Patrick Hall, 2 years, 3 months ago.
Tags: , , .

interwiki links on Wikipedia Wikipedia is the biggest multilingual project, ever. “Interwiki links,” links between articles in all the Wikipedias, constitute an impressive translation database. The question is, could it be harvested somehow?

What kind of information is in these links? Well, translation: The Tasmanian Wilderness is called タスマニア原生地域 in Japanese… the Snowdon lily is called Späte Faltenlilie in German… a Moustached Warbler is called a Мустакато шаварче) in Bulgaria… and the translation for secondary sex characteristics are available for Español (Spanish), Lietuvių (Lithuanian), Svenska (Swedish), or 中文 (Chinese). And on and on.

I’m only just starting to look into this stuff, but the thought of somehow programmatically collecting these correspondences as a rough multilingual lexicon is pretty interesting.Which links are made, between which languages, is all still a pretty random affair, but there a lot of Wikipedians (myself included) working on fleshing them out. It seems that in the future this will be a sizeable resource indeed. And though there is no metadata in these interwiki links, I suspect that as a whole they will turn out to be more robust than the Wiktionary.

Harvesting and analysis efforts are already being made with Interwiki bots. This graph, for instance, was generated by a automated tool (which apparently lacked a Japanese font!).

A graph of links between Wikipedia articles in several languages

A final prerequisite, and fundamental one: in order to re-use any content from Wikipedia, one would have to spend some time thinking about where interwiki links themselves fall under the Wikipedia’s license — the GNU Free Documentation License. But it’s not clear to me how this applies to the interwiki links alone. The Wikipedia:Forking FAQ says:

As set forth at Wikipedia:Copyrights#Definitions_and_trademarks, Wikipedia considers each Wikipedia article to be an individual document. Moreover, for the purposes of creating derivative works of individual Wikipedia articles, Wikipedia considers a direct link-back to a particular Wikipedia article as being in full compliance with the GNU Free Documentation License (GFDL), provided your derivative work is also licensed under the auspices of the GFDL. As such, would-be Wikipedia forkers need not worry about the challenges involved in setting up a large-scale Web site.

But a link isn’t a document. And should one cite each link with another link? And how could we help to give back to the interwiki linking effort?

Lots of stuff to think about, here. The legalities can get a bit tedious. But they’re certainly important!

Update: Here’s a little comparison of the translations in interwiki links compared to the number of translations on Wiktionary, in this case for the article Fungus:

  1. فطر
  2. Fungi
  3. Гъби
  4. ব্যাঙের_ছাতা
  5. Fong
  6. Houby
  7. Ffwng
  8. Svampe
  9. Pilze
  10. Fungo
  11. Fungi
  12. Seened
  13. Sienet
  14. Mycota
  15. Fungas
  16. פטריות
  17. Gomba
  18. Fungi
  19. 菌類
  20. 균류
  21. Fungi
  22. Pilzeräich
  23. Grybų_karalystė
  24. Габа
  25. Poggenstöhl
  26. Schimmels
  27. Sopper
  28. Grzyby
  29. Fungos
  30. Грибы
  31. Svampar
  32. பூஞ்சைகள்
  33. เห็ดรา
  34. Mantar
  35. Tchampion
  36. 真菌

Wiktionary has seven entries.

Why doesn’t Google index Khmer and Amharic?

Update: They fixed it! All the links that previously didn’t work in this post, do now. Good job Google, better late than never! ☺ (That’s our working motto around here, too…)

Note: you might want to install Khmer and Ethiopic fonts. But you can still get the idea behind this post without having them installed.

Compare:

Three searches for ភាសាខ្មែរ (”Khmer language”) on Yahoo, MSN, and Google. You can click on the images to run the searches yourself.

Successful search for a Khmer word on Yahoo

Successful search for a Khmer word on MSN

Failed search for a Khmer word on Google

Google doesn’t just return zero results, it returns nothingness.

A Google blue screen of death.

And it’s not just Khmer:

Here’s a search for ዩኒኮድ (”Unicode” in Amharic, Tigrigna, and several other Ethiopic languages)
on Yahoo, MSN, and Google.

Successful search for an Ethiopic word on Yahoo

Successful search for an Ethiopic word on MSN

Failed search for an Ethiopic word on Google

Once again, Google gives us nothing for those queries. Nary a “Did you mean” or “Your search did not match any documents.”

Just zilch. Zippo. Nada. Niente.

It’s not like nobody at Google has ever heard of these languages: unlike Yahoo and MSN, Google has actually been localized into Amharic, Tigrigna, and Khmer. And they’ve all got millions of speakers.

So what gives?

Theories welcome, I have none.

Update: They don’t bother to index Burmese, either. ဗမာစာ. Yahoo does. MSN , too.

Lame, Google.

Some Unicode News

Written by Patrick Hall, 2 years, 3 months ago.
Tags: , , , .

A couple of news bits from the world of Unicode (both harvested from the Unicode mailing list):

A new site about the intersection of open source and Unicode: Edward Trager’s unifont.org.

This web site provides information about Unicode fonts, Unicode-enabled software, internationalization, and Unicode usability issues on free/libre/open source (FLOSS) operating systems.

This site is awesome, and looks like it will only get better.

He addresses one situation in particular which has caused me no end of frustration: Designing a Better Font Selection Widget. If you’ve ever tried to select a font for a non-latin script language from an endless list, you’ll be nodding in agreement a couple paragraphs into that one.

But the star of the show so far is the Unicode Font Guide For Free/Libre Open Source Operating Systems, which includes and absurdly cool shell script which will go out and automatically download a bunch of Unicode fonts just for YOU.

Don’t miss this, even if you’re using OSX or Windows, I think they’re all just TrueType fonts.

The other topic is the beginning of public vetting for the Common Locale Data Repository:

CLDR Version 1.3 contained data for 96 languages and 130 territories. In addition to new data, CLDR 1.4 will add new structure to support, among other things: flexible date & time formatting (including quarters), measurement system names, segmentation (customized line & word break), and transliteration. For more information, see http://www.unicode.org/cldr/.

During this period, the Unicode consortium encourages the submission of proposed new data and proposed corrections into the repository. Most of the data can be entered or viewed via the newly-revised Survey Tool at http://unicode.org/cldr/apps/survey . An “Instructions” link on that page provides usage information.

The survey tool is pretty neat, much more navegable than the raw xml files I pointed to the last time I blogged about CLDR.

There’s nothing like a nice Friday night post about Unicode to warm my heart. ☺

I’ll see your “d”, and raise you an “eth”

Written by Patrick Hall, 2 years, 3 months ago.
Tags: , , , , , , .

Weird.

I was searching for the string {adra sakka} (don’t ask), and look what Google gave back:

[PDF] SIGRÚN ANNA ÓLAFSDÓTTIR
File Format:
PDF/Adobe Acrobat - View as HTML
Ef nemendum mistekst er alltaf hægt að fá nýtt blað og gera aðra tilraun. Að lokum eiga nemendur að sikk sakka brúnirnar á blaðinu, allan hringinn.
www.nams.is/textilmennt/pdf/Saumar.pdf - Supplemental Result - Similar pages - Remove result

That is, I asked for d (plain old “d”), I got ð U+00F0 LATIN SMALL LETTER ETH.

It just seems like a weird thing to fold together. Or maybe the Icelanders and the Faroese don’t bother with so much with their eths?