Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

-ia -ie… doh! A Tale of a Wayward Wildcards

Written by Patrick Hall, 1 year, 3 months ago.
Tags: , , , , .

Here’s a little tale about the dangers of counting words.

Ooh, danger.

We’ve been working on a lexicon tool for Blogamundo, which has wildcard support. So you can do things like search for:

ro/en/Revolu*

…which means “find me all the English translations for Romanian words that start with Revolu. It’s fun to play with, and useful when translating.

But here’s the interesting thing. If you actually run that query on the lexicon we’ve bootstrapped from Wikipedia, you get results like this:

Revoluţia franceză » French Revolution
	
Revoluţia din Februarie » February Revolution
	
Revoluţia engleză » English Civil War
	
Revoluţie » Revolution
	
Revoluţia industrială » Industrial Revolution
	
Revoluţia din Octombrie » October Revolution
	
Revoluţia din Neolitic » Neolithic Revolution

And then a bazillion more revolutions.

So as I sat staring at that pattern of words, as I am wont to do, I thought “Hmm, maybe if we counted up the word frequencies of every word in that whole result list, the actual pair Revolution » {whatever the Romanian word for revolution is} would bubble up to the top.” It seemed at a glance that the translation was obvious: Revoluţia. But if we could use frequency alone to detect that pair automatically, perhaps it would be possible to run the same trick with other search result sets, and thusly improve the lexicon.

So I tested my little theory by simply splitting the result set into words and counting them all up:

    21 Revoluţia
    15 Revolution
    7 of
    7 din
    4 revolution
    4 de
    4 1848
    3 la
    3 Revolutionary
    2 Socialist

As you can see, Revolution and Revoluţia are by far the most common. This strongly suggested, I imagined, that Revolution and Revoluţia were in fact translations.

Except they aren’t.

Not exactly, anyway: as clever people have probably already noticed, I had actually missed the answer, sitting in the results of the initial query, precisely because I had used a wildcard search. There it was, plain as day:

Revolution » Revoluţie

..with an -e! If I had looked up en/ro/Revolution in the first place, it would have been a unique result.

Come to find out, after Wikipediacizing a bit, all this is to do with the fact that Revoluţia is showing up in definite noun phrases. There was only one October Revolution (thankfully), so we have Revoluţia din Octombrie, but when the word for “revolution” stands alone (as it does in the name of the Romanian article on “Revolution”, we get Revoluţie. Not sure on how those details work out, but the distinction is plain enough.

I can attest to the fact that this frequency trick often works for finding translated pairs of words, I’ve done it a lot. But in this case, at least, grammatical variation within the target side of the results leads to the numbers being a bit misleading.

12 Comments for '-ia -ie… doh! A Tale of a Wayward Wildcards'

  1. Comment received 1 year, 3 months ago from Isabelle

    But why bother with a theory, when your very first return list makes it clear right from the start: Revoluţie » Revolution? Oh, OK, you just like to have fun with words and queries ;-)

  2. Comment received 1 year, 3 months ago from Patrick Hall

    Hi Isabelle,

    Yes, I confess, playing weird statistical games with words is an addiction of mine. :)

    Actually, I should have put in another example. The whole point of that this little saga was this: imagine (as I sloppily did) that the pair Revoluţie » Revolution was not in the lexicon already.

    If I had searched just the English side for “Revolution,” and counted up the most common word on the Romanian side, I would have found Revoluţia as a likely translation. In a sense it is a translation; it’s just not what we think of as the “dictionary” translation. So even there it would be useful as a start.

    And if I’m a English/Romanian translator who (for some bizarre reason!) doesn’t know how to say “revolution”, that would be information enough. After all, I’d know the grammar of the language, so I’d be able to figure out what was up with the endings.

    I’ll try to come up with some more practical examples and post them in another comment.

    Thanks for stopping by!

  3. Comment received 1 year, 3 months ago from Chris

    There was only one October Revolution (thankfully) […]

    Uh, what? That’s not what I learnt in school. I can think of a bunch of them, actually.

  4. Comment received 1 year, 2 months ago from Mihai

    From a romanian:
    Revolution » Revoluţie
    The revolution » Revoluţia

  5. Comment received 1 year, 2 months ago from Patrick Hall

    Mulţumesc Mihai!

  6. Comment received 1 year, 2 months ago from Patrick Hall

    Chris, you can always edit the Romanian Wikipedia ☺

  7. Comment received 1 year, 1 month ago from Mihai

    Now, to be 100% correct, the Romanian language uses t and s with comma below (U+0218, U+0219, U+021A, U+021B), not the forms with cedilla (U+015E, U+015F, U+0162, U+0163)
    So it is “Revoluție/Revoluția/Mulțumesc” :-D

    Mulțumesc,
    Mihai

  8. Comment received 1 year, 1 month ago from Patrick Hall

    Oh, wow, I totally did not know that.

    The font layout on this blog is rather too small to see such distinctions; Ctrl+’ing a bit, I see what you mean! However, the Cedilla forms seem to be the ones in use on Wikipedia. Is this a matter of contention on the Romanian Wikipedia?

    Some random Googling:
    blogul să vă - Google Blog Search

    for Romanian blogs turned up at least one with mixed usage. In this blog, we see the t-cedilla in the title of the post, t-comma in the body, but s-cedilla everywhere:

    Idioţi, agramaţi şi tupeişti din toată lumea, LA MULŢI ANI! « De ce urâm bărbaţii

    It’s fun digging around for such details, thanks for pointing them out.

    Mulțumesc again! ☺

  9. Comment received 1 year, 1 month ago from MunteAlb

    The correct letters are S and T with comma, not with cedilla. But unfortunately they were not supported under Windows until Vista came out. So the romanians wrote on the web using S and T with cedilla because this was the only available alternative (but many chose not to write at all with diacritics, because the webpages were not displayed correctly on systems with the language set to english).

    Tthere is no contention at all, Wikipedia RO uses S/T-with-cedilla because the majority of contributors do not use Vista or have not installed the MS patch that adds support for S/T-with-comma in XP.

    The problems are not fully solved. Some computer programs support Unicode in their interface, so they support also S/T-with-comma, but the majority of software supports only Windows-1250, that is S/T-with-cedilla. Also, I use Vista and this comment box does not allow diacritics at all , regardless of the romanian keyboard driver that I use.

  10. Comment received 1 year, 1 month ago from Patrick Hall

    Hi,

    Thanks for the detailed explanation!

  11. Comment received 6 months, 2 weeks ago from Michail Stoiculescu

    Ar putea cineva sa corecteze:
    Not Romany Vlax but Aromanian
    The page [http://blogamundo.net/lab/wordlengths/]
    at no. 199 is not Romany Vlax language but Aromanian, a Romanian dialect.
    Regards,
    Michail Stoiculescu

  12. Comment received 5 days, 16 hours ago from tom

    […] 27, 2008 by Gershom Gorenberg To continue a conversation with Haim about politics and physics: Faux pas, shmaux pas. In physics, action and reaction refer to motion. In Israeli-Palestinian relations, […]

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.