Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Some random linkage…

Written by Patrick Hall, 2 years ago.
Tags: , , , .

Here’s a couple thoughts.

Detecting Spam Web Pages through Content Analysis
Via News you can Bruise, a nice paper on spam detection. But this little buried tidbit was of most interest to yours truly: “In our data set, the majority of the pages (about 54%) are written in the English language, as determined by the parser used by MSN Search.” Just over half, that’s all! The authors included a member of Microsoft Research, so presumably they had pretty unlimited access to the crawler. Their language identification algorithms is proprietary, according to the footnote, so who knows how good it is. But if that number is accurate, it’s even more evidence that English isn’t the 500-pound gorilla of the net anymore…
Indiantelevision.com > Media, Advertising & Marketing Watch > NRS 2005: ‘Jagran’ topples ‘Bhaskar’ to claim top slot
Remember print? I would suspect that in India that newspapers are still a relatively more important source of news than the web or TV. (Anybody know whether that’s true?) Anyway, this article describes a telling recent change: the newspaper with the largest circulation on the subcontinent is now in Hindi, not English: India Today. (Just checked their site… it’s not Unicode, ew, ew!!!)

No Comments for 'Some random linkage…'

No comments yet.

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.