Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

How do you Google Something you can’t Spell?

Written by Patrick Hall, June 3rd, 2006

The most recent installment of a series of fun posts on language identification came and went too soon for me to notice and send in my guess (I woulda been right, too, dagnabbit!), but the ensuing discussion has gotten me thinking about something. I’ll just bring the topic up tonight. Er, this morning.

(Psst: dig their new book.)

Anyway, the cat’s out of the bag now, the language in question was Romanian.

It just so happens that a few days ago I was reading an interesting short paper called “Using N-grams to Process Hindi Queries with Transliteration Variations” (pdf). As you may be aware, There’s a lot of Hindi on the web that’s written in transliterated form — especially Bollywood lyrics. (The motivations of why such texts are transliterated instead of written in Devanāgarī merits another post…)

I’ve been hacking on a bit of code to simulate the approach in the paper, but it’s not really ready for prime time yet. Anyway, here’s what reminded me of the paper:

Language Log Transcription Transcription found online
Gaşca-i adunată din mii
Nu s-a schimbat
În pliu fu şi în fiţe o ţine ne-ncetat
Tatuaje noi, inele şi cercei
Stând doar pe MTV
Şi nu ne pasă ce zic ei

E o lume nouă
Una nouă nouă nouă
E o lume nouă

Gashca-i adunata,
nimic nu s-a schimbat
In chefuri si in fite o tinem ne-ncetat
Cu tatuaje noi, inele si cercei
Stam doar pe MTV
si nu ne pasa ce zic ei

E o lume 9, 1@999
E o lume 9, yeah-yeah-yeah-yeah
E o lume 9, 1@999
E o lume 9, hei-hei-hei-hei.

The “authentic” version is the one with the “wrong” orthography. That’s a pretty common situation — Brazilians, for instance, will write «naum» for não, «eh» for é, and so on. These web orthographies may not make language teachers smile, but they’re by no means marginal in statistical terms.

And from the point of view of a web search, the difference is more than academic. After all, the search Gaşca-i adunată din mii fails, but Gashca-i adunata, nimic nu s-a schimbat succeeds.

And this problem is what the paper describes: a simple method of using n-grams (substrings of words) to perform fuzzy matches. Queries for a song titles, as a matter of fact, such as jane na nazar jigar pehchanay in one of a bazillion idiosyncratic transliterations in a database where the song might actually be recorded as jaane na nazar pehchaane jigar yeh kaun — I presume that pehchanay and pehchaane, jane and jaane are in fact different ways to transcribe the same word.

The search engines of today don’t really allow you to get around such problems — and while Google’s spelling suggestion tool makes an effort, it doesn’t really help much here.

Anyway, like I said, I’m just bringing up the topic. Fuzzy matching on stuff like this is a fun thing to code. It’s surprising how successful a simple approach can be. I’ll try to post that code if I get around to making it readable.

1 Comment for 'How do you Google Something you can’t Spell?'

  1. Comment received June 3rd, 2006 from Vigrx Plus

    I would like to say you all some this. This type of blogs is very informative but some guys miss use it just like thay divert the topic and I don’t like this . http://www.penisenlargementy.com

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.