Here’s a little tale about the dangers of counting words.
Ooh, danger.
We’ve been working on a lexicon tool for Blogamundo, which has wildcard support. So you can do things like search for:
ro/en/Revolu*
…which means “find me all the English translations for Romanian words that start with Revolu. It’s fun to play with, and useful when translating.
But here’s the interesting thing. If you actually run that query on the lexicon we’ve bootstrapped from Wikipedia, you get results like this:
Revoluţia franceză » French Revolution
Revoluţia din Februarie » February Revolution
Revoluţia engleză » English Civil War
Revoluţie » Revolution
Revoluţia industrială » Industrial Revolution
Revoluţia din Octombrie » October Revolution
Revoluţia din Neolitic » Neolithic Revolution
And then a bazillion more revolutions.
So as I sat staring at that pattern of words, as I am wont to do, I thought “Hmm, maybe if we counted up the word frequencies of every word in that whole result list, the actual pair Revolution » {whatever the Romanian word for revolution is} would bubble up to the top.” It seemed at a glance that the translation was obvious: Revoluţia. But if we could use frequency alone to detect that pair automatically, perhaps it would be possible to run the same trick with other search result sets, and thusly improve the lexicon.
So I tested my little theory by simply splitting the result set into words and counting them all up:
21 Revoluţia
15 Revolution
7 of
7 din
4 revolution
4 de
4 1848
3 la
3 Revolutionary
2 Socialist
As you can see, Revolution and Revoluţia are by far the most common. This strongly suggested, I imagined, that Revolution and Revoluţia were in fact translations.
Except they aren’t.
Not exactly, anyway: as clever people have probably already noticed, I had actually missed the answer, sitting in the results of the initial query, precisely because I had used a wildcard search. There it was, plain as day:
Revolution » Revoluţie
..with an -e! If I had looked up en/ro/Revolution in the first place, it would have been a unique result.
Come to find out, after Wikipediacizing a bit, all this is to do with the fact that Revoluţia is showing up in definite noun phrases. There was only one October Revolution (thankfully), so we have Revoluţia din Octombrie, but when the word for “revolution” stands alone (as it does in the name of the Romanian article on “Revolution”, we get Revoluţie. Not sure on how those details work out, but the distinction is plain enough.
I can attest to the fact that this frequency trick often works for finding translated pairs of words, I’ve done it a lot. But in this case, at least, grammatical variation within the target side of the results leads to the numbers being a bit misleading.