Some random linkage…
Here’s a couple thoughts.
- Detecting Spam Web Pages through Content Analysis
- Via News you can Bruise, a nice paper on spam detection. But this little buried tidbit was of most interest to yours truly: “In our data set, the majority of the pages (about 54%) are written in the English language, as determined by the parser used by MSN Search.” Just over half, that’s all! The authors included a member of Microsoft Research, so presumably they had pretty unlimited access to the crawler. Their language identification algorithms is proprietary, according to the footnote, so who knows how good it is. But if that number is accurate, it’s even more evidence that English isn’t the 500-pound gorilla of the net anymore…
- Indiantelevision.com > Media, Advertising & Marketing Watch > NRS 2005: ‘Jagran’ topples ‘Bhaskar’ to claim top slot
- Remember print? I would suspect that in India that newspapers are still a relatively more important source of news than the web or TV. (Anybody know whether that’s true?) Anyway, this article describes a telling recent change: the newspaper with the largest circulation on the subcontinent is now in Hindi, not English: India Today. (Just checked their site… it’s not Unicode, ew, ew!!!)
No comments yet.
Technorati tags: english, hindi, Language and the Web, हिन्दी