Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

“Machine Translation” is a Misnomer

Written by Patrick Hall, 4 hours, 21 minutes ago.
Tags: , , .

An ickle rant ensues.

The public doesn’t understand how machine translation works. And generally speaking, the public doesn’t understand that machine translation couldn’t exist without human translators in the loop.

In other words, it’s not really “machines” that are “doing” the translating, it’s people. The machines are simply programmed to imitate the translations the people do.

I think this is symptomatic of a widespread disease in the computer world: an obsession with cutting people out of the loop.

It’s the same tunnel vision that motivated the much (and rightly) criticized footer that graced Google News at its launch:

This page was generated entirely by computer algorithms without human editors.

No humans were harmed or even used in the creation of this page.

O’Reilly Network — Google Needs People

Admittedly, this quote has long since been removed, and it was probably only meant as a joke in the first place anyway. But this problematic attitude still underlies a lot of reactions to Machine Translation. In general, people don’t realize that it wouldn’t be possible without human translators.

Yes, the title is pretty much a joke. I do visit reality once in a while.

Aligning translations with text compression?

Written by Patrick Hall, 2 days, 7 hours ago.
Tags: , , , .

Dear interwebs series of tubes people:

Random thought:

If you have two translations, and you perform some sort of compression on the both of them, could interesting relationships between the content of the two translations be uncovered? For instance, it seems like you might be able to get rid of non-content words, which might make it conceivable to align the texts at a phrase level.

I’ve dug a bit, but only found a paper by Conley and Klein, “Using Alignment for Text Compression.” But a quick glance at that (haven’t read it yet) suggests that they’re interested in improving compression for compressions sake, which isn’t what I have in mind.

Thanks for your thoughts and observations, interwebs.

Unicode headed toward World Domination™

Written by Patrick Hall, 1 week, 3 days ago.
Tags: , .

The Google Blog has a chart showing that there is a very clear trend toward Unicode adoption.

Apparently their numbers refer to UTF-8 alone (as opposed to UTF-16/UCS-2 or (haha)UTF-32/UCS-4), which again is good news. (Though one wonders if there is any uptake of UTF-16 on the web… I hope not.)

The data is “Google internal”… peer-reviewed, it ain’t.

Thanks to Won for the pointer!

Of the Media, Scientists, Word Lengths, and Colossal Squids

Written by Patrick Hall, 2 weeks ago.
Tags: , , .

I’m a big fan of Pharyngula, but… he was kinna wrong about a nitpicky little detail. And this particular nitpicky little detail was about language, so, the truth must out!

In his post As big as dinner plates?, Dr. Myers compares two articles about the recent dissection of a Colossal squid. (Can we please pause to acknowledge that that thing is HUGE? Kthx.)

Read the USA Today article on the colossal squid eye, which boils down to basically, “Oooh, they’re big!”. Then compare it to the blog entry on the colossal squid eye, written by a scientist. The latter is much more informative, and contains more specific details, and isn’t afraid to challenge the reader with words longer than a single syllable.

Emphasis added, to the bit that’s linguistic. Now obviously, that’s an exaggeration; the AP article (not USA Today, in fact) doesn’t really consist of one-syllable words (though one can say a lot with one-syllable words…)

But the idea is clear enough: USA Today uses shorter words than scientists, because journalists dumb down science.

Right?

Heck, I dunno. So, I answered the question the way I usually do: I wrote a program.

Surprise!

Average Word Lengths
4.45 Blog
4.64 AP
	
Longest words
Blog:
photoreceptors
tremendously
considerably
architeuthis
neighbouring
cephalopods
cephalopods
disappeared
	
AP:
mesonychoteuthis
communications
international
invertebrates
redistributed
formaldehyde
centimeters
spectacular

Average word length is about the same. Which says nothing whatsoever about the quality of information in the articles; it does however say that they use words that are about the same size.

One can argue about the character of the long words in each text (you could use frequency counts of them, too), but still, the AP article uses the word “Mesonychoteuthis.” That’s a long way from a one-syllable word.

Here’s the code.