Using Search Engines to Find Blogs by Language
One key part of Blogamundo is going to be an aggregator. And of course, it will be multilingual. And when you put “multilingual” and “web” into the same sentence, sadly enough, you’re going to have to deal with encodings, too. Or rather, you’re going to have to end up converting everything which isn’t, er, pure, to The One True Encoding.
Which will involve testing, which involves finding a bunch of blogs in some particular language in whatever encodings that language happens to be written in.
Now, all of these noble goals aside, the fact is I think looking for blogs in languages I don’t know is… well, kind of fun really.
So thought I’d write down a few approaches we’ve used, starting with the a blatantly obvious.
Comments are of course welcome…
Let’s take, oh, I dunno, Turkish as an example. Obviously the first stop is search engines.
Blog search engines
This idea is painfully simple: just pick a word that is likely to show up in the language in question. City names work well.
→ Technorati
They’ve got language identification stuff built in, but let’s just start with a likely keyword:
Searching for “ankara”—sure enough, we get some hits there: one, two, three
(These three all on MSN, interestingly enough.)
Restricting to Turkish works as expected:
Technorati search in Turkish ‘ankara’
→ Google Blog Search
Here’s the equivalent search on Google’s blog search thingie, also restricted to Turkish:
Google Blogsearch in Turkish, ‘ankara’
More Turkish results: one, two, three
→ Icerocket
Works as well, but there is no language restriction option on Icerocket.com, apparently.
Well that was simple.
Turkish was easy; there seems to be plenty of Turkish blog content out there, and two out of three search engines have an option for restricting searches to Turkish. (Wikipedia tells us that there are something like 60 – 75 million Turkish speakers—it’s a pretty big language.)
So this game wasn’t too challenging. Between those three search engines we would probably be able to come up with maybe 100 URLs of Turkish blogs, just by digging around in blogrolls and or writing a script to spider them (or using an existing script like Sean B. Palmer’s).
Next time we’ll look at building such a list, and then try to get a bit of info about Turkish word frequency, which we can then feed back into our searches. After all, not everybody’s blogging about Ankara.
But is it really?
But we’ll also take a closer look at the results we’ve gotten: are they really all in Turkish? I suspect that we’ll find some edge cases sooner or later.
1 comment.
Technorati tags: Blogamundo, language, search
If your looking for anything Turkish use my search engine.