Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Welcome to Nowhere

Written by Patrick Hall, 2 months, 2 weeks ago.
Tags: , , , .

And now, a wee rant.

Check out this page:

CityCarClub.info

Now, tell me where exactly in the world this carsharing business exists. This is the front page of the company!

I was linked to this site by a friend in Helsinki (in the context of another conversation, about carsharing, as it happens). But the only place Helsinki is mentioned on this company’s website is in a caption to a photo. We are also met with this quote:

City Car Club offers a cost effective and easy to use way of driving. We have different types of cars and vans available at over 70 locations in the capital area.

The capital area?

So this is a Finnish carsharing company, based in Helsinki (I think), with a website in English, and nary a mention of Helsinki.

Or Finland.

Heck, or even Finnish… in order to get to the Finnish version, I just randomly guessed and went to CityCarClub.fi.

And guess what? No mention of Helsinki there either! (Or Helsingissä or any of the other 148614601 forms a Finnish noun can take…)

I have noticed that the same thing is often true of local newspapers’ websites. They’ll be called “The Gazette” or “The Tribune” or “The Local Paper.”

But they don’t tell you where they are!

Does this strike anyone else around here as weird?

Sheesh, talk about localization issues…

Machine translation and Open Source

Written by Patrick Hall, 2 months, 2 weeks ago.
Tags: , , .

Information Week blogger Serdar Yegulalp has some thoughts on the intersection of machine translation and open source:
Talk To Me, Openly - Open Source Blog - InformationWeek

He’s got an interesting anecdote about how he tackled studying Japanese, and it serves as an interesting intro to the idea behind bitext and statistical machine translation:

..Since I didn’t have money for classes, I homebrewed my own self-teaching method. I went out and bought a grammar guide, and then two copies of a given book — one in Japanese, the other an English translation — and sat with them side-by-side, comparing the two on a sentence-by-sentence and phrase-by-phrase level. It worked, up to a point, and while I’m no native speaker I can certainly figure out a fair amount of what’s put in front of me as long as I have a dictionary.

I didn’t know it at the time, but this parallel-texts technique is actually one of the best ways to also teach a computer to perform translations between languages.

He’s also got some thoughts on licensing issues involved with the data used to build MT systems, which is a topic which I don’t think has gotten enough attention.

(Please consider this an open thread for your thoughts on how MT and FOSS can and should interact.)

Don’t sort stuff in Unicode with Bash?

Written by Patrick Hall, 2 months, 4 weeks ago.
Tags: , .

Update: Okay, duh: I shouldn’t have called it “Bash”. What I meant was, “whatever the sort utility is in my default terminal.” Which, as Bryan points out in a comment below, has nothing to do with Bash: it’s GNU Sort. More updates below.

I have a little text file with “Hello World” in lots of languages, which I often use for testing. I extracted a few lines with various scripts and saved that as helloworld.txt.

$ cat helloworld.txt
สวัสดีราคาถูก!  Thai
Habari dunia!   Kiswahili
Halló heimur!   Icelandic
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
Привет, мир!    Russian
ሠላም ዓለም!        Amharic
안녕, 세상!     Korean
Chào thế giới!  Vietnamese
Hallo, wrâld    Frisian
Hallo verden!   Norwegian/Bokmal
Laba ryta, pasauli!     Lithuanian

For my first amazing trick, I sort the file with the Bash shell built-in:

$ sort helloworld.txt
ሠላም ዓለም!        Amharic
Chào thế giới!  Vietnamese
Habari dunia!   Kiswahili
Halló heimur!   Icelandic
Hallo verden!   Norwegian/Bokmal
Hallo, wrâld    Frisian
안녕, 세상!     Korean
Laba ryta, pasauli!     Lithuanian
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
สวัสดีราคาถูก!  Thai
Привет, мир!    Russian

…which sucks. Because obviously Bash is ignoring anything fancy (Amharic, Korean, Thai) and sorting strictly by whatever ASCII shows up in the line. (Hard to say whether the «ó» in Icelandic is being considered, but shouldn’t it come after «o» anyway?)

I also installed and tried another terminal called rxvt-unicode, which supposedly has better Unicode support. I got the same results as what I got in Bash under gnome-terminal, which suggests to me that the problem is Bash, or somewhere deeper, and not the terminal itself. I got the same result.

$ python
>>> lines = open('helloworld.txt').read().decode('utf-8').splitlines()
>>> for line in sorted(lines): print line
...
Chào thế giới!  Vietnamese
Habari dunia!   Kiswahili
Hallo verden!   Norwegian/Bokmal
Hallo, wrâld    Frisian
Halló heimur!   Icelandic
Laba ryta, pasauli!     Lithuanian
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
Привет, мир!    Russian
สวัสดีราคาถูก!  Thai
ሠላም ዓለም!        Amharic
안녕, 세상!     Korean

Python does better; clearly things are being sorted according to their Unicode code points. Which of course is a far cry from following UTS #10: Unicode Collation Algorithm, but that has to do with locales and all that.

In any case, I won’t be trusting Bash to sort Unicode files any more.

(I’d be interested to know what the default sort does to the initial input in various other programming languages, comments welcome.)

Update:

After Bryan’s comment pointed out that it wasn’t Bash that I was even dealing with, but rather GNU sort , reading through the manual I discovered the following trick in a footnote:

$ export LC_ALL=C; sort hw.txt
Chào thế giới!  Vietnamese
Habari dunia!   Kiswahili
Hallo verden!   Norwegian/Bokmal
Hallo, wrâld    Frisian
Halló heimur!   Icelandic
Laba ryta, pasauli!     Lithuanian
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
Привет, мир!    Russian
สวัสดีราคาถูก!  Thai
ሠላም ዓለም!        Amharic
안녕, 세상!     Korean

Which seems to be what I was looking for.

“Translation is hugely useful!”

Written by Patrick Hall, 3 months, 1 week ago.
Tags: .

Here’s a post from Salon.com’s interesting How the World Works Globalization blog about an amazing translator:

Salon.com Technology | Fragments of the Tocharian

Translation is an under-appreciated art, but Ji Xianlin risked his life for his craft:

…he secretly translated the entire Indian epic, “The Ramayana,” from the original Sanskrit into Chinese, while experiencing the travails that afflicted nearly all Chinese intellectuals during the Cultural Revolution.

Ji’s observation about the utility of translation is trenchant:

It is translation that has preserved the perpetual youth of Chinese civilization. Translation is hugely useful!

We agree.