Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Wikipedia’s Interwiki Links

Written by Patrick Hall, 2 years, 4 months ago.
Tags: , , .

interwiki links on Wikipedia Wikipedia is the biggest multilingual project, ever. “Interwiki links,” links between articles in all the Wikipedias, constitute an impressive translation database. The question is, could it be harvested somehow?

What kind of information is in these links? Well, translation: The Tasmanian Wilderness is called タスマニア原生地域 in Japanese… the Snowdon lily is called Späte Faltenlilie in German… a Moustached Warbler is called a Мустакато шаварче) in Bulgaria… and the translation for secondary sex characteristics are available for Español (Spanish), Lietuvių (Lithuanian), Svenska (Swedish), or 中文 (Chinese). And on and on.

I’m only just starting to look into this stuff, but the thought of somehow programmatically collecting these correspondences as a rough multilingual lexicon is pretty interesting.Which links are made, between which languages, is all still a pretty random affair, but there a lot of Wikipedians (myself included) working on fleshing them out. It seems that in the future this will be a sizeable resource indeed. And though there is no metadata in these interwiki links, I suspect that as a whole they will turn out to be more robust than the Wiktionary.

Harvesting and analysis efforts are already being made with Interwiki bots. This graph, for instance, was generated by a automated tool (which apparently lacked a Japanese font!).

A graph of links between Wikipedia articles in several languages

A final prerequisite, and fundamental one: in order to re-use any content from Wikipedia, one would have to spend some time thinking about where interwiki links themselves fall under the Wikipedia’s license — the GNU Free Documentation License. But it’s not clear to me how this applies to the interwiki links alone. The Wikipedia:Forking FAQ says:

As set forth at Wikipedia:Copyrights#Definitions_and_trademarks, Wikipedia considers each Wikipedia article to be an individual document. Moreover, for the purposes of creating derivative works of individual Wikipedia articles, Wikipedia considers a direct link-back to a particular Wikipedia article as being in full compliance with the GNU Free Documentation License (GFDL), provided your derivative work is also licensed under the auspices of the GFDL. As such, would-be Wikipedia forkers need not worry about the challenges involved in setting up a large-scale Web site.

But a link isn’t a document. And should one cite each link with another link? And how could we help to give back to the interwiki linking effort?

Lots of stuff to think about, here. The legalities can get a bit tedious. But they’re certainly important!

Update: Here’s a little comparison of the translations in interwiki links compared to the number of translations on Wiktionary, in this case for the article Fungus:

  1. فطر
  2. Fungi
  3. Гъби
  4. ব্যাঙের_ছাতা
  5. Fong
  6. Houby
  7. Ffwng
  8. Svampe
  9. Pilze
  10. Fungo
  11. Fungi
  12. Seened
  13. Sienet
  14. Mycota
  15. Fungas
  16. פטריות
  17. Gomba
  18. Fungi
  19. 菌類
  20. 균류
  21. Fungi
  22. Pilzeräich
  23. Grybų_karalystė
  24. Габа
  25. Poggenstöhl
  26. Schimmels
  27. Sopper
  28. Grzyby
  29. Fungos
  30. Грибы
  31. Svampar
  32. பூஞ்சைகள்
  33. เห็ดรา
  34. Mantar
  35. Tchampion
  36. 真菌

Wiktionary has seven entries.

6 Comments for 'Wikipedia’s Interwiki Links'

  1. Comment received 2 years, 4 months ago from chris

    The example of your second paragraph illustrates one of the major uses Wikipedia has for me: a multilingual dictionary of biological terms. I might know the name of a bird or a plant in German. So I go find the Wikipedia page in German, click on the interwiki link, and the page in English or French (or whatever) will not only tell me the official name, bu t also things like colloquial terms etc. It sure has improved my vocabulary.

  2. Comment received 2 years, 4 months ago from Patrick Hall

    Yeah, interestingly I seem to have come across a fair number of animal names among those random links. (By the way, I really did use the random page link to find those! I didn’t pick “secondary sexual characteristics” in a lascivious attempt at getting some Google juice. ☺)

    In the case of bird names, there are actually other sites, like Avibase which may be more complete, but are which are under unclear licensing.

  3. Pingback received 2 years, 4 months ago from links for 2006-02-16 | Edward O’Connor

    […] Wikipedia’s Interwiki Links Harvesting a multilingual lexicon out of Wikipedia’s interwiki links. Clever. (tags: linguistics translation) […]

  4. Comment received 2 years, 4 months ago from Denis Jacquerye

    I don’t think you have to worry about licensing for most uses of the interwiki links. These are very much like regular hyperlinks.

    Much of the interwiki work is checked throught the Interwiki-Link-Checker. Unfortunately this only works with homonymous articles on different Wikis. So if es:Fungi, it:Fungi, ast:Fungi and la:Fungi aren’t linked yet, it will let users check they deal with the same topic.

    The rest of work is mostly done by bots propagating interwiki links that have been added to some articles but not to all translations. This leads to unsupervised errors, like the one shown in the diagram.

    Interwiki have to be bijective by design. This is rather bad because some articles are split in some wikis but not on others. Some people are thinking about a centralised interwiki system, where contributors could edit links once for all the projects; much like Commons is for images and other media. This should be implementable once the whole Wikidata thing is set up. The WiktionaryZ project should help out too.

  5. Comment received 2 years, 4 months ago from Patrick Hall

    Hey Denis,

    Thanks for the link to the link checker, very interesting tool, and I spent some quality time with it checking Portuguese/Spanish links. (What happens after they’re checked, by the way? The links didn’t seem to be automatically added…)

    As for the issue of many-to-many versus one-to-one interwiki links, yeah, that’s a very good point. It’s like a bilingual lexicon where every term could only be translated by another, single term.

    Some terms, though, like “fungus” or “mycology” (for some reason I have been on a fungus vocabulary kick of late… portabellos, anyone?) are pretty much going have only one translation. The more scientific or “personal” the term is, the more likely that is to be the case. A really common word, say, “house,” is liable to have lots of tranlsations and ambiguity.

    Or at least, that’s my intuition — there must be research about the proportions of such words in the lexicon somewhere.

    It’s debatable just how reliable and accurate the terms in these links are, but my cursory investigation would suggest that they’re pretty reliable, and it’s certainly beyond doubt that there’s an awful lot of info in there. At least as a bootstrapping tool, they seem awful tempting.

    (I’m writing a little SAX parser thingie to try to harvest the terms. Good grief, the varieties of Python XML libraries is bewildering.)

  6. Comment received 5 months, 1 week ago from Prashanth Ellina

    Hey, I’ve been working on Wikipedia graph based data analysis. You may be interested in reading more about ut here http://blog.prashanthellina.com/2007/12/21/topic-extraction-using-wikipedia-data/

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.