How do you Google Something you can’t Spell?
The most recent installment of a series of fun posts on language identification came and went too soon for me to notice and send in my guess (I woulda been right, too, dagnabbit!), but the ensuing discussion has gotten me thinking about something. I’ll just bring the topic up tonight. Er, this morning.
(Psst: dig their new book.)
Anyway, the cat’s out of the bag now, the language in question was Romanian.
It just so happens that a few days ago I was reading an interesting short paper called “Using N-grams to Process Hindi Queries with Transliteration Variations” (pdf). As you may be aware, There’s a lot of Hindi on the web that’s written in transliterated form — especially Bollywood lyrics. (The motivations of why such texts are transliterated instead of written in Devanāgarī merits another post…)
I’ve been hacking on a bit of code to simulate the approach in the paper, but it’s not really ready for prime time yet. Anyway, here’s what reminded me of the paper:
| Language Log Transcription | Transcription found online |
|---|---|
|
Gaşca-i adunată din mii Nu s-a schimbat În pliu fu şi în fiţe o ţine ne-ncetat Tatuaje noi, inele şi cercei Stând doar pe MTV Şi nu ne pasă ce zic ei E o lume nouă |
Gashca-i adunata, nimic nu s-a schimbat In chefuri si in fite o tinem ne-ncetat Cu tatuaje noi, inele si cercei Stam doar pe MTV si nu ne pasa ce zic ei E o lume 9, 1@999 |
The “authentic” version is the one with the “wrong” orthography. That’s a pretty common situation — Brazilians, for instance, will write «naum» for não, «eh» for é, and so on. These web orthographies may not make language teachers smile, but they’re by no means marginal in statistical terms.
And from the point of view of a web search, the difference is more than academic. After all, the search Gaşca-i adunată din mii fails, but Gashca-i adunata, nimic nu s-a schimbat succeeds.
And this problem is what the paper describes: a simple method of using n-grams (substrings of words) to perform fuzzy matches. Queries for a song titles, as a matter of fact, such as jane na nazar jigar pehchanay in one of a bazillion idiosyncratic transliterations in a database where the song might actually be recorded as jaane na nazar pehchaane jigar yeh kaun — I presume that pehchanay and pehchaane, jane and jaane are in fact different ways to transcribe the same word.
The search engines of today don’t really allow you to get around such problems — and while Google’s spelling suggestion tool makes an effort, it doesn’t really help much here.
Anyway, like I said, I’m just bringing up the topic. Fuzzy matching on stuff like this is a fun thing to code. It’s surprising how successful a simple approach can be. I’ll try to post that code if I get around to making it readable.
1 comment.
Tags: hindi, Language and the Web, română, romanian, translation, हिन्दी
I would like to say you all some this. This type of blogs is very informative but some guys miss use it just like thay divert the topic and I don’t like this . http://www.penisenlargementy.com