Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Unicode headed toward World Domination™

Written by Patrick Hall, 6 days, 5 hours ago.
Tags: , .

The Google Blog has a chart showing that there is a very clear trend toward Unicode adoption.

Apparently their numbers refer to UTF-8 alone (as opposed to UTF-16/UCS-2 or (haha)UTF-32/UCS-4), which again is good news. (Though one wonders if there is any uptake of UTF-16 on the web… I hope not.)

The data is “Google internal”… peer-reviewed, it ain’t.

Thanks to Won for the pointer!

Of the Media, Scientists, Word Lengths, and Colossal Squids

Written by Patrick Hall, 1 week, 2 days ago.
Tags: , , .

I’m a big fan of Pharyngula, but… he was kinna wrong about a nitpicky little detail. And this particular nitpicky little detail was about language, so, the truth must out!

In his post As big as dinner plates?, Dr. Myers compares two articles about the recent dissection of a Colossal squid. (Can we please pause to acknowledge that that thing is HUGE? Kthx.)

Read the USA Today article on the colossal squid eye, which boils down to basically, “Oooh, they’re big!”. Then compare it to the blog entry on the colossal squid eye, written by a scientist. The latter is much more informative, and contains more specific details, and isn’t afraid to challenge the reader with words longer than a single syllable.

Emphasis added, to the bit that’s linguistic. Now obviously, that’s an exaggeration; the AP article (not USA Today, in fact) doesn’t really consist of one-syllable words (though one can say a lot with one-syllable words…)

But the idea is clear enough: USA Today uses shorter words than scientists, because journalists dumb down science.

Right?

Heck, I dunno. So, I answered the question the way I usually do: I wrote a program.

Surprise!

Average Word Lengths
4.45 Blog
4.64 AP
	
Longest words
Blog:
photoreceptors
tremendously
considerably
architeuthis
neighbouring
cephalopods
cephalopods
disappeared
	
AP:
mesonychoteuthis
communications
international
invertebrates
redistributed
formaldehyde
centimeters
spectacular

Average word length is about the same. Which says nothing whatsoever about the quality of information in the articles; it does however say that they use words that are about the same size.

One can argue about the character of the long words in each text (you could use frequency counts of them, too), but still, the AP article uses the word “Mesonychoteuthis.” That’s a long way from a one-syllable word.

Here’s the code.

Worth a look: An Introduction to Opentype

Written by Patrick Hall, 2 weeks, 1 day ago.
Tags: , .

I’m a Unicode geek, but I feel like I don’t really know enough about what goes on in operating systems after the encoding and decoding issues are worked out. That is, when and where does all of that glyph selection and font shaping and other black magic actually happen? How much of it is dependent on the operating system? Applications? Fonts?

Head a splodes.

But whatever, live and learn. I did run across a very nicely done introduction to what appears to be the cutting edge in computer typography, OpenType:

Adam Twardoch’s PDF slides on OpenType: Typographic perfection with OpenType?

It gives a good idea of the sort of subtleties that OpenType can handle. And it’s very pretty. Font nerds, rejoice…

Upper West Side, Zona Sul, and other tricky subdivisions

Written by Patrick Hall, 2 weeks, 3 days ago.
Tags: .

Random translation observation:

I was translating some Brazilian Portuguese into English, and the source article was about an earthquake near São Paulo. (See? Brazil does have natural disasters!)

A particular phrase got me thinking: a zona leste de São Paulo meaning something like “the Eastern Zone of São Paulo.” That’s a pretty tricky thing to translate―you don’t really talk about the “Eastern Zone” of a city in English.

In the States, you might talk about the “Upper East Side” of New York, or the “South Side” of Chicago, or “Northwest” (sometimes just “NW”) in DC. My Brazilo-Londonian homey Carlos tells me that zones in London have numbers, so you talk about “Zone 5,” etc.

And then there are those arrondissements in Paris, which are numbered like an escargot.

So does “The East Side of São Paulo” work as a translation for “a zona leste de São Paulo”? Sounds okay to me, actually.

I’d be curious to know about the ways that other cities are subdivided.

Unicode support in Ruby1.9! Yippee!

Written by Patrick Hall, 1 month, 1 week ago.
Tags: , , .

$ cat unicode.rb
# -*- coding: utf-8 -*-

s = “ABCあいう”
puts “s: #{s}”
puts “s[0]: #{s[0]}”
puts “s[3,1]: #{s[3,1]}”

puts “s.length: #{s.length}”
puts “s.reverse: #{s.reverse}”
puts “s.encoding: #{s.encoding}”

$ ruby1.8 unicode.rb
s: ABCあいう
s[0]: 65
s[3,1]:
s.length: 12
s.reverse: ��㄁め�CBA
unicode.rb:10: undefined method `encoding’ for “ABC\343\201\202\343\201\204\343\201\206″:String (NoMethodError)

$ ruby1.9 unicode.rb
s: ABCあいう
s[0]: A
s[3,1]: あ
s.length: 6
s.reverse: ういあCBA
s.encoding: UTF-8
$ # yay!

Big ups to Matz.

Welcome to Nowhere

Written by Patrick Hall, 2 months, 1 week ago.
Tags: , , , .

And now, a wee rant.

Check out this page:

CityCarClub.info

Now, tell me where exactly in the world this carsharing business exists. This is the front page of the company!

I was linked to this site by a friend in Helsinki (in the context of another conversation, about carsharing, as it happens). But the only place Helsinki is mentioned on this company’s website is in a caption to a photo. We are also met with this quote:

City Car Club offers a cost effective and easy to use way of driving. We have different types of cars and vans available at over 70 locations in the capital area.

The capital area?

So this is a Finnish carsharing company, based in Helsinki (I think), with a website in English, and nary a mention of Helsinki.

Or Finland.

Heck, or even Finnish… in order to get to the Finnish version, I just randomly guessed and went to CityCarClub.fi.

And guess what? No mention of Helsinki there either! (Or Helsingissä or any of the other 148614601 forms a Finnish noun can take…)

I have noticed that the same thing is often true of local newspapers’ websites. They’ll be called “The Gazette” or “The Tribune” or “The Local Paper.”

But they don’t tell you where they are!

Does this strike anyone else around here as weird?

Sheesh, talk about localization issues…

Machine translation and Open Source

Written by Patrick Hall, 2 months, 1 week ago.
Tags: , , .

Information Week blogger Serdar Yegulalp has some thoughts on the intersection of machine translation and open source:
Talk To Me, Openly - Open Source Blog - InformationWeek

He’s got an interesting anecdote about how he tackled studying Japanese, and it serves as an interesting intro to the idea behind bitext and statistical machine translation:

..Since I didn’t have money for classes, I homebrewed my own self-teaching method. I went out and bought a grammar guide, and then two copies of a given book — one in Japanese, the other an English translation — and sat with them side-by-side, comparing the two on a sentence-by-sentence and phrase-by-phrase level. It worked, up to a point, and while I’m no native speaker I can certainly figure out a fair amount of what’s put in front of me as long as I have a dictionary.

I didn’t know it at the time, but this parallel-texts technique is actually one of the best ways to also teach a computer to perform translations between languages.

He’s also got some thoughts on licensing issues involved with the data used to build MT systems, which is a topic which I don’t think has gotten enough attention.

(Please consider this an open thread for your thoughts on how MT and FOSS can and should interact.)

Don’t sort stuff in Unicode with Bash?

Written by Patrick Hall, 2 months, 3 weeks ago.
Tags: , .

Update: Okay, duh: I shouldn’t have called it “Bash”. What I meant was, “whatever the sort utility is in my default terminal.” Which, as Bryan points out in a comment below, has nothing to do with Bash: it’s GNU Sort. More updates below.

I have a little text file with “Hello World” in lots of languages, which I often use for testing. I extracted a few lines with various scripts and saved that as helloworld.txt.

$ cat helloworld.txt
สวัสดีราคาถูก!  Thai
Habari dunia!   Kiswahili
Halló heimur!   Icelandic
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
Привет, мир!    Russian
ሠላም ዓለም!        Amharic
안녕, 세상!     Korean
Chào thế giới!  Vietnamese
Hallo, wrâld    Frisian
Hallo verden!   Norwegian/Bokmal
Laba ryta, pasauli!     Lithuanian

For my first amazing trick, I sort the file with the Bash shell built-in:

$ sort helloworld.txt
ሠላም ዓለም!        Amharic
Chào thế giới!  Vietnamese
Habari dunia!   Kiswahili
Halló heimur!   Icelandic
Hallo verden!   Norwegian/Bokmal
Hallo, wrâld    Frisian
안녕, 세상!     Korean
Laba ryta, pasauli!     Lithuanian
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
สวัสดีราคาถูก!  Thai
Привет, мир!    Russian

…which sucks. Because obviously Bash is ignoring anything fancy (Amharic, Korean, Thai) and sorting strictly by whatever ASCII shows up in the line. (Hard to say whether the «ó» in Icelandic is being considered, but shouldn’t it come after «o» anyway?)

I also installed and tried another terminal called rxvt-unicode, which supposedly has better Unicode support. I got the same results as what I got in Bash under gnome-terminal, which suggests to me that the problem is Bash, or somewhere deeper, and not the terminal itself. I got the same result.

$ python
>>> lines = open('helloworld.txt').read().decode('utf-8').splitlines()
>>> for line in sorted(lines): print line
...
Chào thế giới!  Vietnamese
Habari dunia!   Kiswahili
Hallo verden!   Norwegian/Bokmal
Hallo, wrâld    Frisian
Halló heimur!   Icelandic
Laba ryta, pasauli!     Lithuanian
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
Привет, мир!    Russian
สวัสดีราคาถูก!  Thai
ሠላም ዓለም!        Amharic
안녕, 세상!     Korean

Python does better; clearly things are being sorted according to their Unicode code points. Which of course is a far cry from following UTS #10: Unicode Collation Algorithm, but that has to do with locales and all that.

In any case, I won’t be trusting Bash to sort Unicode files any more.

(I’d be interested to know what the default sort does to the initial input in various other programming languages, comments welcome.)

Update:

After Bryan’s comment pointed out that it wasn’t Bash that I was even dealing with, but rather GNU sort , reading through the manual I discovered the following trick in a footnote:

$ export LC_ALL=C; sort hw.txt
Chào thế giới!  Vietnamese
Habari dunia!   Kiswahili
Hallo verden!   Norwegian/Bokmal
Hallo, wrâld    Frisian
Halló heimur!   Icelandic
Laba ryta, pasauli!     Lithuanian
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
Привет, мир!    Russian
สวัสดีราคาถูก!  Thai
ሠላም ዓለም!        Amharic
안녕, 세상!     Korean

Which seems to be what I was looking for.

“Translation is hugely useful!”

Written by Patrick Hall, 3 months, 1 week ago.
Tags: .

Here’s a post from Salon.com’s interesting How the World Works Globalization blog about an amazing translator:

Salon.com Technology | Fragments of the Tocharian

Translation is an under-appreciated art, but Ji Xianlin risked his life for his craft:

…he secretly translated the entire Indian epic, “The Ramayana,” from the original Sanskrit into Chinese, while experiencing the travails that afflicted nearly all Chinese intellectuals during the Cultural Revolution.

Ji’s observation about the utility of translation is trenchant:

It is translation that has preserved the perpetual youth of Chinese civilization. Translation is hugely useful!

We agree.

New Wikipedia Parser i18n

Written by Patrick Hall, 5 months, 3 weeks ago.
Tags: , .

From the Wikipedia mailing list:

[Wikipedia-l] New parser in the works - please help

On the outside chance that any reader of this blog hasn’t edited Wikipedia, what they’re talking about here is the parser that interprets the markup that’s used to write Wikipedia articles. (You can see some by clicking “edit” on any Wikipedia article.) This stuff is called “Wikitext.”

An interesting detail of the project from our point of view is that attention is being made to internationalizing the parser:

Some of what some people would think of as a “stupid parser trick” is
in fact important - e.g. L”’uomo'’ which renders as L<i>uomo</i>
(necessary for French and Italian).

So: we need to know what MediaWiki quirks are supporting important
constructs in languages other than English (which is the language the
list is in, and is the native language of most of the participants),
and particularly in non-European languages.

This list is unlikely to implement new features, e.g. (an example
brought up by GerardM) the double-apostrophe in Neapolitan. But we
really need to know about present important features that wouldn’t be
obvious to an English-speaker going through the present parser code.

It will be interesting to follow this stuff; I do a fair amount of trying to parse up Wikitext myself.

(PS. Wordpress has totally borkedup that Italian example in the quote above; just check it out on the original link.)

Next Page »