Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Loonicode+0004

Written by Patrick Hall, 1 year, 11 months ago.
Tags: , , .
Loonicode 4

Virtual Language Maps: I’d Buy that for a Dollar

Written by Patrick Hall, 1 year, 11 months ago.
Tags: .

There’s an interesting website called Worldmapper which bills itself as : “The world as you’ve never seen it before.” They have a collection of maps called “cartograms” that are scaled in clever ways in order to express complex global data.

So for instance, you can see at a glance that there are a lot of elderly people in India or that Brazil is a popular destination for refugees.

The maps can be a little challenging to read when countries become highly distorted (just which Scandinavian country is that giant lavender blob?), but even so, it’s a very useful view on data that’s inherently complex anyway.

I wish there were maps like this that dealt with linguistic topics. Mapping language is a pretty challenging task, but I’m intrigued that the Worldmapper software has been released by its creator, Michael Gastner.

There’s really a whole field here waiting to be explored, and I bet that there is already work being done (links welcome!). Some data is starting to trickle in — the now (sadly) defunct NITLE Weblog Census was an consequential early effort, and recently there was an interesting conversation around some language data collected by Technorati.

Ethan Zuckerman has an interesting metric that looks at the distribution of languages in Wikipedia that measures “the number of wikipedia articles per million native speakers of the language (WA/MS) for languages with over 30 million speakers.

It would be fascinating to see all this sort of data charted onto a geographical map — Iceland would eat Greenland for lunch, for starters. I have a couple of hunches of my own that I’d like to test out.

But being bridge-of-nose-deep in hacking on Blogamundo of late (news soon), I don’t really have time to take a stab at the required data munging and so forth right now, but, ya know, some day.

Language Choosing Widgets

Written by Patrick Hall, 1 year, 12 months ago.
Tags: , , .

Sorry for the dearth of updates of late, been hackin!

But it is a good excuse to use the word “dearth.”

Web design geekery ensues…

In a previous post, I rather snarkily opined on the futility of using flags to identify languages. I still think flags are a lousy way to distinguish languages, but my kvetching doesn’t really address a solution to the problem:

What kind of web interface do you use to select from a long list of languages?

The obvious answer, and the one I sort of implied in the aforementioned post, is just to give a list of the languages in their native scripts. But such a list can quickly become unwieldy if it’s long enough.

So I figured I’d look around see what other sites do. One could do this sort of thing forever, so I’m just going to look at some big international news sites today: BBC World Service , Deutsche Welle , and Voice of America News. (Add square quotes around “news” in accordance with your politics… as far as Blogamundo is concerned, I have none.)

(Please leave comments with links to other similar sites if you know of any, particularly those in non-European languages.)

Deutsche Welle, oddly enough, has the same language chooser twice, as highlighted here:

Deutsche Welle language choosers

DW seems (as far as I can tell) to restrict their language choice interface to a simple dropdown box, whereas BBC World Service (which hosts a similar 33 languages) has an entire “language portal” — exclusively for selecting languages: BBC World Service | Languages. I think they did a very nice job of combining a geographical and textual method of listing the available languages:

BBC World Service language portal

However, there are some rather confusing inconsistencies in the hierarchy to ponder here: compare the entries for Portuguese:

  • Americas/Portuguese (BRASIL)
  • Africa/Portuguese (PORTUGUÊS)

However, there is no analogous subentry for European Portuguese. (perhaps BBC World has no such service?) The rest of the list is similarly unsystematic; but the BBC tends to be pretty careful about interface details — they may well have reasons for that particular categorization.

I find the Voice of America language interface the most useful, as it combines both the BBC and DW approaches. There is a simple graphical chooser on the main page that looks like this:

VOA language chooser

This small world map has four clickable regions: more or less corresponding to The Americas, Europe, Asia, and Africa/Middle East. Each of those areas have their own page, which can in turn be viewed by language:

VOA Middle East and Africa by Language

Or country:

VOA Middle East and Africa by Country

If all goes well and we end up with content translated between lots of languages, I suspect we’ll do something similar to the VOA approach — a “portal” page for several geographical regions, further sortable by language or country (the data in the Common Locale Data Repository should prove useful in localizing this).

Wikipedia sticks with a simple list all languages in their native script (UTF-8 encoded, naturally), sorted by the number of articles.

(Of course, Wikipedia being Wikipedia, there are a thousand alternative ways to view that list, including by country or language, see the List of Wikipedias.)

I’m sidestepping another difficult question: which languages to include in the first place. A recent thread on the Wikipedia mailing list, and Ultimate Wiktionary guy Gerard Meijssen attest to the difficultly of deciding on a language coding standards.

Unicode Notes

Written by Patrick Hall, 2 years ago.
Tags: .

I just ran across an interesting set of articles on the : Unicode Technical Notes. The topics are surprisingly broad, and a few are likely to be of interest even to not-so-gearheady language geeks:

Languages of the Blogosphere

Written by Patrick Hall, 2 years ago.
Tags: , , , , , , , .

Dave Sifry’s recent recent state of the blogosphere post has gotten a fair amount of attention — his data suggest that English has lost the title of “most common language of blogs” to Japanese. He points out some important caveats on this claim, however, you can read more about those in Dave’s post and some further interesting comments at Ethan Zuckerman’s response.

Those caveats are enough to keep me from taking the specific ranking of languages too seriously in detail. (But then, it’s not as if any particular ranking of the popularity of languages in the world can be taken too seriously, either.)

More importantly, I don’t think the specifics of the ranking really matter that much. If we put aside the “English apocalypse” banter that many responses have focused on, we can see the more important message here:

The blog world is a very multilingual place, and there isn’t any language in which a majority of blogs are written.

As for the specific contention that Japanese is a heavy hitter in the blog world… well… that’s not too surprising, it’s Japan! And I don’t see anything too shocking in the rest of the list. Actually, it’s quite similar in broad outline to the list of most active Wikipedias. (Although I share this blogger’s surprise [ES] that the number of Spanish blogs seems to have decreased the rate of growth in the number of Spanish blogs seems to have slowed relative to that of other languages.)

However, I do have another problem with this ranking: we only get to hear about the top of the list. I’d like to see the big picture — the top 100 or so.

Are they still using the default set of languages that Maciej Ceglowski built into his initial release of Languid, or have the Technorati folks added languages to the default list of 70 or so languages? (Languid, like its predecessor TextCat , can only identify languages on which it has been trained, of course.)

For me the most insteresting linguistic data with regard to the blogosphere isn’t in the top ten, it’s in the nascent blogging communities that are just now popping up. I watched with amazement as the Welsh blogosphere grew from just one guy into a sprawling community. There seemed to be a “critical mass” sort of phenomenon that took place there: suddenly there were too many Welsh blogs to keep in your aggregator. (Even assuming that you could read more than, oh, a paragraph a day. That’s about my rate with Welsh. ☺)

How about it, Dave, any more data to share?