Counting Words in the Age of Unicode
It’s pretty easy to get a reasonable word count in a language where words are separated by spaces, using Python in this case:
(Note: I’m using the Universal Declaration Human Rights as sample texts — you can too: http://www.unicode.org/udhr/ )
>>> len(open('udhr_eng.txt').read().split())
1778
It’s a bit trickier if the text is in a language that uses something besides a space to delimit words, like Amharic. But it’s still not hard:
>>> len(open('udhr_amh.txt').read().decode('utf-8'))
6603
>>> len(open('udhr_amh.txt').read().decode('utf-8').split(u'፡'))
974
(Splitting on U+1361 ETHIOPIC WORDSPACE there, which you probably can’t see unless you have an Ethiopic font.)
But, for languages like Japanese, things get tricky:
>>> len(open('udhr_jpn.txt').read().decode('utf-8').split())
123
Which is obviously wrong.
Because Japanese doesn’t delimit words with… well, anything really. The same sort of challenge holds for Chinese, and also for Southeast Asian languages like Thai, Khmer, and Cambodian, which uses spaces to delimit phrases, not words.
So what’s a translator to do?
Here’s my opinion:
1. Count letters (codepoints, to be precise)
2. Come up with an average letters/word number for each language, eg, “Thai text has n letters per word on average” (with a real value for n which would require a bit of research)
3. Calculate (1)/(2), and charge based on that number.
One question immediately springs to mind: how much variation is there in the average letters/word number? Is the ratio too variable to be useful?
2 comments.
Technorati tags: Linguistic Computing, segmentation, translation