h
a
c
k
l
o
g

(Language) Lost in Aggregation

Written by Patrick Hall, July 20th, 2007

Lots of “aggregation”-style sites, where content streams in either from various users (del.icio.us , Digg / News , Reddit ) or automatically from various sites via feeds or via search (Bloglines , Technorati, Afrigator), face an interesting problem: how should such a service deal with multilingual input?

Let me show you what I mean with just a few examples:

del.icio.us

Japanese links on del.icio.us

Technorati

Spanish link on Technorati

Afrigator

French link on Afrigator

Disclaimer: Creating an aggregator which expressly allows and even encourages multilingual content is a perfectly noble thing to do. There are tons of reasons why one would want such a thing. (Perhaps the aggregator is from Switzerland or South Africa!)

In each of the cases in the screenshots, you have a language which is not “the” language of the site (such links are shaded yellow): in these instances we see Japanese on Del.icio.us, Spanish on Technorati, French on Afrigator. But this is the nature of a collaborative site; clearly, in the case of del.icio.us, there are a ton of Japanese users*.

What is the right way to handle language identification in an aggregator?

  1. Let it be. Aggregators are supposed to be “emergent” anyhow.
  2. Use the Spec. Rely on things like lang attributes in XHTML and all that to classify posts.
  3. Get statistical. Classify posts automatically by spidering the links, then running a statistical language identifier on the page.

My own opinions on these hypothetical attitudes:

  1. Okay John Lennon, but… The fact is, nobody reads every language. And if current trends continue, the linguistic diversity of the web is only going to grow and grow. Does it make sense to rule out even the option of filtering content in aggregators by language, just to be… um… emergent?
  2. Specs are all well and good but… Sinful though it may be, people don’t use the lang attribute consistently (yet), in XHTML forms or anywhere else.
  3. Just count letters! This is the option I favor, in theory.

Let’s talk about that third option. Implementating a statistical language identifier isn’t too hard.

Here’s a buggy but fairly functional one I wrote by leveraging some existing libraries:

what language is this?

The problem is, (most) aggregators aren’t search engines. They don’t want to go spidering every one of the bazillions of pages that people post. Seriously, think of what Digg.com would have to do to spider every page posted. It would have to be Google.

So, there you go. I don’t know what the right answer is.

What are you reactions to these “attitudes”? Are there any I have missed?

*Interestingly, the tag I took that screenshot from is from http://del.icio.us/popular/tool, and there’s also a Japanese word “ツール” (tsuuru, “tool”), which has its own tag: http://del.icio.us/tag/ツール. Somebody oughta write a paper about these sorts of cross-linguistic tag relationships…

1 Comment for '(Language) Lost in Aggregation'

  1. [...] Hacklog: Blogamundo » Blog Archive » (Language) Lost in Aggregation identify which language something is for aggregators (tags: i18n) [...]

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> . Don't forget to close them after use.