Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

More on translation length: Word lengths in many languages

Written by Patrick Hall, 1 year ago.
Tags: , , .

The previous post spawned an interesting comment thread, thanks to everyone for the input and ideas!

Serendipitously enough, Richard Ishida at the w3c recently published Text size in translation, which has some numbers relevant to our discussion about translation length.

He cites some data from IBM which suggests that if you translate from English into a “European” language (whatever that means), your text gets longer in general. According to these numbers, if you start with a text of 10 characters, you’ll probably end up with a translation of about 25 characters. If you start with a text of 70 characters, you’ll probably end up with a translation of about 105 characters.

Richard also cites some research he himself did on localization in Flickr. He found that a typical interface term such as views could end up 300% longer in Italian (visualizzazioni). Korean (조회), by way of comparison, comes out shorter - just 2 letters (or better, perhaps, “syllabic glyphs”).

Following up on his idea, I decided to look at average word length across a wide variety of languages. You can see the results here:


Languages by Average word length (click to see table)

What’s the bottom line? (Keeping in mind that the tail end of the list is screwy because of languages that don’t delimit words, because the definition of “word” is fuzzy, etc. etc.)

Average “word” length varies from something like 3 characters (Dangme) to somewhere around 15 (Inuktitut )

So, quite apart from the issues of translator skill, it seems undeniable that if you translate a text from Inuktitut to Dangme, it will come out shorter. (And that’s a HUGE market right there ;) )

Thoughts about the chart are welcome… I gotta get back to work!

If you’re feeling a little masochistic you can also take a look at the 20 minutes worth of grungy code I used to build that table. (There are some dependencies mentioned inline; if you have trouble running it let me know & I’ll try to help you out/clean it up): udhr_word_lengths.py

3 Comments for 'More on translation length: Word lengths in many languages'

  1. Comment received 1 year ago from dda

    조회 is actually 5 letters: ㅈㅗㅎㅗㅣ grouped into two syllables.
    You have to play a bit with normalization to get the number of letters in Korean words.

  2. Comment received 1 year ago from Richard Ishida

    I think you need to be clear about whether you are comparing size of strings (eg. number of characters needed in database fields) or visual widths (and heights when multiple lines are involved). For visual width, once you move outside of the languages that use simple characters you need to count length in something like nominal widths or overall word length (and height), rather than characters. This is because (a) there are combining characters, shape changes, ligatures, etc. that collapse width (b) some characters are much bigger than others in scripts like Arabic, and complicated characters such as those in Chinese are systematically wider than Latin characters.

  3. Comment received 1 year ago from Patrick Hall

    Hi Richard,

    An honor :D

    You raise an excellent point; in a way I was really comparing apples and oranges, I suppose. Translators in the “people who translate prose” sense tend to think strictly in the number of characters (or multiples such as words), whereas for interface designers and other folks dealing with l10n the sorts of issues you describe can be at least as important as the words themselves.

    As far as my own programming skills go (I’m pretty much an exclusively scripting language kind of guy), the only way to actually access the width of of individual characters is to ask the DOM. (I’m sure it’s possible to get such data from closer to the metal languages, but I never use them.)

    I found this article inspiring on that topic:

    lalit dot lab: JavaScript/CSS Font Detector

    It would be interesting

    As a bit of an aside, it seems to me that it would be possible to test font support for particular (human) languages by using Javascript to render and measure width variation of a sample text — if the character widths are detected to vary sufficiently, then it’s probable that the user has a font that supports the language in question.

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.