Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Counting Words in the Age of Unicode

Written by Patrick Hall, 6 months, 2 weeks ago.
Tags: , , .

It’s pretty easy to get a reasonable word count in a language where words are separated by spaces, using Python in this case:

(Note: I’m using the Universal Declaration Human Rights as sample texts — you can too: http://www.unicode.org/udhr/ )


>>> len(open('udhr_eng.txt').read().split())
1778

It’s a bit trickier if the text is in a language that uses something besides a space to delimit words, like Amharic. But it’s still not hard:


>>> len(open('udhr_amh.txt').read().decode('utf-8'))
6603
>>> len(open('udhr_amh.txt').read().decode('utf-8').split(u'፡'))
974

(Splitting on U+1361 ETHIOPIC WORDSPACE there, which you probably can’t see unless you have an Ethiopic font.)

But, for languages like Japanese, things get tricky:


>>> len(open('udhr_jpn.txt').read().decode('utf-8').split())
123

Which is obviously wrong.

Because Japanese doesn’t delimit words with… well, anything really. The same sort of challenge holds for Chinese, and also for Southeast Asian languages like Thai, Khmer, and Cambodian, which uses spaces to delimit phrases, not words.

So what’s a translator to do?

Here’s my opinion:

1. Count letters (codepoints, to be precise)
2. Come up with an average letters/word number for each language, eg, “Thai text has n letters per word on average” (with a real value for n which would require a bit of research)
3. Calculate (1)/(2), and charge based on that number.

One question immediately springs to mind: how much variation is there in the average letters/word number? Is the ratio too variable to be useful?

Translating for an Audience

Written by Patrick Hall, 6 months, 3 weeks ago.
Tags: .

Sometimes understanding who you’re translating for is as important as understanding what you’re translating.

Consider:

Brochures, fliers and the Red Cross Web site have been translated into a handful of languages — instructions on how to give first aid, survive an earthquake and create a family emergency plan are all available in Spanish, Russian, Korean, Arabic and Vietnamese.

But the problem is with more than just information, says Mar Tobiasom, of the Snohomish County Red Cross. Some residents don’t know what a smoke alarm is, and for them, being reminded to change the battery isn’t helpful.

[Van Dinh-Kuno, executive director of the Refugee and Immigrant Forum of Snohomish County], agrees.

“Some of the people from the ESL [English as a second language] populations, they don’t even read and write in their own language,” she says, adding that agencies trying to reach non-English speakers should stick with short, direct messages.

Snohomish County News: Emergency teams focus on language gulf

The Emergency team’s director also makes this observation about the social aspects of translation:

“The foundation is, first, ID’ing those who can speak the languages, two is who is connected in the community, and three is how can we pull those pieces together and reach out to the community.”

It’s an interesting to try to picture how these sorts of restrictions port to the web, and specifically, to localization contexts.

(I’ll leave that as an exercise to the reader.)

Is Translation Accessibility?

Written by Patrick Hall, 6 months, 3 weeks ago.
Tags: , .

Making your site more accessible will help you get your content through to some fraction of all the folks who have trouble cutting through the UI to the content.

Translating your site into Chinese will help you get your content through to some fraction of one billion people.

This is purely rhetorical; we try as hard as we can to use Javascript and CSS unobtrusively, and all that. Heck, it might even be breaking the law to make a website unaccessible in this sense.

But doesn’t translating also make a site more accessible?

I’m Just Kidding But…

Written by Patrick Hall, 6 months, 3 weeks ago.
Tags: .

Eww.

A Glimpse at Internationalized Domain Names (IDNs)

Written by Patrick Hall, 7 months ago.
Tags: .

You may have heard that work is being to standardize URLs in non-Latin scripts.

Here’s a nice overview from the Washington Post:

Other alphabets take on English domain.

One thing that I didn’t realize was that not only domain names, but also Top-level domains are going to be localized: there will be translations of .com, .org, .net and so on in all the various scripts in Unicode: Cyrillic, Chinese, Ethiopic, etc.

Still learning about all this, myself. More links here:

Dictionaries for Minimalists

Written by Patrick Hall, 7 months, 2 weeks ago.
Tags: .

And now a dictionary format I like to parse.

Because lazy can be good.

I like dictionary files like this:

house
casa

book
livro

Entries are separated by \n\n, and terms are separated by \n.

I think it’s better than this, which for some reason is quite popular in lexicons on the web:

house - casa
book - livro

Or this:

house casa
book livro

(Those are supposed to be tabs, but may well not be by the time they got to you… which is why I don’t like tabs.)

Arguably, you might want to add language tag labels to my favorite format, because it’s a bit safer:

EN house
PT casa

EN book
PT livro

…because sometimes people would accidentally put “casa” preceeding “house,” instead of after it, which is the convention this file is apparently following.

But, you run that risk in any format. You run that risk in XML, as a matter of fact. So you have to deal with it anyway. Which brings me back to my favorite format, which is easy to type, easy to parse, easy to store. It’s easy.

house
casa

book
livro