h
a
c
k
l
o
g

Counting Words in the Age of Unicode

Written by Patrick Hall, October 31st, 2007

It’s pretty easy to get a reasonable word count in a language where words are separated by spaces, using Python in this case:

(Note: I’m using the Universal Declaration Human Rights as sample texts — you can too: http://www.unicode.org/udhr/ )


>>> len(open('udhr_eng.txt').read().split())
1778

It’s a bit trickier if the text is in a language that uses something besides a space to delimit words, like Amharic. But it’s still not hard:


>>> len(open('udhr_amh.txt').read().decode('utf-8'))
6603
>>> len(open('udhr_amh.txt').read().decode('utf-8').split(u'፡'))
974

(Splitting on U+1361 ETHIOPIC WORDSPACE there, which you probably can’t see unless you have an Ethiopic font.)

But, for languages like Japanese, things get tricky:


>>> len(open('udhr_jpn.txt').read().decode('utf-8').split())
123

Which is obviously wrong.

Because Japanese doesn’t delimit words with… well, anything really. The same sort of challenge holds for Chinese, and also for Southeast Asian languages like Thai, Khmer, and Cambodian, which uses spaces to delimit phrases, not words.

So what’s a translator to do?

Here’s my opinion:

1. Count letters (codepoints, to be precise)
2. Come up with an average letters/word number for each language, eg, “Thai text has n letters per word on average” (with a real value for n which would require a bit of research)
3. Calculate (1)/(2), and charge based on that number.

One question immediately springs to mind: how much variation is there in the average letters/word number? Is the ratio too variable to be useful?

Translating for an Audience

Written by Patrick Hall, October 26th, 2007

Is Translation Accessibility?

Written by Patrick Hall, October 26th, 2007

I’m Just Kidding But…

Written by Patrick Hall, October 22nd, 2007

A Glimpse at Internationalized Domain Names (IDNs)

Written by Patrick Hall, October 17th, 2007

Dictionaries for Minimalists

Written by Patrick Hall, October 4th, 2007