-ia -ie… doh! A Tale of a Wayward Wildcards
Here’s a little tale about the dangers of counting words.
Ooh, danger.
We’ve been working on a lexicon tool for Blogamundo, which has wildcard support. So you can do things like search for:
ro/en/Revolu*
…which means “find me all the English translations for Romanian words that start with Revolu. It’s fun to play with, and useful when translating.
But here’s the interesting thing. If you actually run that query on the lexicon we’ve bootstrapped from Wikipedia, you get results like this:
Revoluţia franceză » French Revolution
Revoluţia din Februarie » February Revolution
Revoluţia engleză » English Civil War
Revoluţie » Revolution
Revoluţia industrială » Industrial Revolution
Revoluţia din Octombrie » October Revolution
Revoluţia din Neolitic » Neolithic Revolution
And then a bazillion more revolutions.
So as I sat staring at that pattern of words, as I am wont to do, I thought “Hmm, maybe if we counted up the word frequencies of every word in that whole result list, the actual pair Revolution » {whatever the Romanian word for revolution is} would bubble up to the top.” It seemed at a glance that the translation was obvious: Revoluţia. But if we could use frequency alone to detect that pair automatically, perhaps it would be possible to run the same trick with other search result sets, and thusly improve the lexicon.
So I tested my little theory by simply splitting the result set into words and counting them all up:
21 Revoluţia
15 Revolution
7 of
7 din
4 revolution
4 de
4 1848
3 la
3 Revolutionary
2 Socialist
As you can see, Revolution and Revoluţia are by far the most common. This strongly suggested, I imagined, that Revolution and Revoluţia were in fact translations.
Except they aren’t.
Not exactly, anyway: as clever people have probably already noticed, I had actually missed the answer, sitting in the results of the initial query, precisely because I had used a wildcard search. There it was, plain as day:
Revolution » Revoluţie
..with an -e! If I had looked up en/ro/Revolution in the first place, it would have been a unique result.
Come to find out, after Wikipediacizing a bit, all this is to do with the fact that Revoluţia is showing up in definite noun phrases. There was only one October Revolution (thankfully), so we have Revoluţia din Octombrie, but when the word for “revolution” stands alone (as it does in the name of the Romanian article on “Revolution”, we get Revoluţie. Not sure on how those details work out, but the distinction is plain enough.
I can attest to the fact that this frequency trick often works for finding translated pairs of words, I’ve done it a lot. But in this case, at least, grammatical variation within the target side of the results leads to the numbers being a bit misleading.
12 comments.
Technorati tags: Code, limba română, Linguistic Computing, romanian, translation
But why bother with a theory, when your very first return list makes it clear right from the start: Revoluţie » Revolution? Oh, OK, you just like to have fun with words and queries ;-)
Hi Isabelle,
Yes, I confess, playing weird statistical games with words is an addiction of mine. :)
Actually, I should have put in another example. The whole point of that this little saga was this: imagine (as I sloppily did) that the pair Revoluţie » Revolution was not in the lexicon already.
If I had searched just the English side for “Revolution,” and counted up the most common word on the Romanian side, I would have found Revoluţia as a likely translation. In a sense it is a translation; it’s just not what we think of as the “dictionary” translation. So even there it would be useful as a start.
And if I’m a English/Romanian translator who (for some bizarre reason!) doesn’t know how to say “revolution”, that would be information enough. After all, I’d know the grammar of the language, so I’d be able to figure out what was up with the endings.
I’ll try to come up with some more practical examples and post them in another comment.
Thanks for stopping by!
Uh, what? That’s not what I learnt in school. I can think of a bunch of them, actually.
From a romanian:
Revolution » Revoluţie
The revolution » Revoluţia
Mulţumesc Mihai!
Chris, you can always edit the Romanian Wikipedia ☺
Now, to be 100% correct, the Romanian language uses t and s with comma below (U+0218, U+0219, U+021A, U+021B), not the forms with cedilla (U+015E, U+015F, U+0162, U+0163)
So it is “Revoluție/Revoluția/Mulțumesc” :-D
Mulțumesc,
Mihai
Oh, wow, I totally did not know that.
The font layout on this blog is rather too small to see such distinctions; Ctrl+’ing a bit, I see what you mean! However, the Cedilla forms seem to be the ones in use on Wikipedia. Is this a matter of contention on the Romanian Wikipedia?
Some random Googling:
blogul să vă - Google Blog Search
for Romanian blogs turned up at least one with mixed usage. In this blog, we see the t-cedilla in the title of the post, t-comma in the body, but s-cedilla everywhere:
Idioţi, agramaţi şi tupeişti din toată lumea, LA MULŢI ANI! « De ce urâm bărbaţii
It’s fun digging around for such details, thanks for pointing them out.
Mulțumesc again! ☺
The correct letters are S and T with comma, not with cedilla. But unfortunately they were not supported under Windows until Vista came out. So the romanians wrote on the web using S and T with cedilla because this was the only available alternative (but many chose not to write at all with diacritics, because the webpages were not displayed correctly on systems with the language set to english).
Tthere is no contention at all, Wikipedia RO uses S/T-with-cedilla because the majority of contributors do not use Vista or have not installed the MS patch that adds support for S/T-with-comma in XP.
The problems are not fully solved. Some computer programs support Unicode in their interface, so they support also S/T-with-comma, but the majority of software supports only Windows-1250, that is S/T-with-cedilla. Also, I use Vista and this comment box does not allow diacritics at all , regardless of the romanian keyboard driver that I use.
Hi,
Thanks for the detailed explanation!
Ar putea cineva sa corecteze:
Not Romany Vlax but Aromanian
The page [http://blogamundo.net/lab/wordlengths/]
at no. 199 is not Romany Vlax language but Aromanian, a Romanian dialect.
Regards,
Michail Stoiculescu
[…] 27, 2008 by Gershom Gorenberg To continue a conversation with Haim about politics and physics: Faux pas, shmaux pas. In physics, action and reaction refer to motion. In Israeli-Palestinian relations, […]