Thoughts on an Open Multilingual Dictionary
In my opinion, these are the criteria for an open multilingual dictionary:
- A free license
- Nothing except pairs of words
- A very simple API
There are many, many dictionaries out there on the web. Some of them are open. Some are under an unclear license. A few have APIs. There are lots of formats, from the rather fancy TMX to oldskool CSV. Wiktionary is cool, but its content lives in a well-nigh impenetrable wiki markup which, frankly, does not lend itself to machine processing. I hasten to point out that I have absolutely nothing against the Wiktionary, and that I recognize that there are many generations of descendents of Wiktionary, about which I’m in no way informed enough to be qualified to opine.
All I can say is that, in my own opinion, what is needed is a simple convention for exchanging pairs of corresponding words in two languages. What is not needed is most of the stuff that a professionally written dictionary contains: pronunciation, etymology, example sentences, carefully weeded sense disambiguation.
These are all excellent goals–hey, I’m a language nerd, I love plowing through etymological dictionaries as much as anyone. But if the goal at hand is to help translators (and that of course is the goal in my own endeavors), the task is best defined as getting all the possibilities in front of the translator. Translators know what they’re doing. They need programmers to help them with information retrieval, not word sense disambiguation.
They can figure out, to take an extreme scenario, which of ten translations of a word is the appropriate in a given context. And if they’re unsure, they’re good at searching to narrow down those possibilities. What is helpful to translators is, to be honest, saving them the effort of looking up a word in many resources, so that they can then apply their refined powers of judgment to picking the correct translation out of a handful of (or even many) possibilities.
Of course, this opionion assumes massive amounts of handwaving magic about just how to use such an API, how to distribute, how to vet, how to do version control, how to do a million other things. But I thought I would put these thoughts out there as a minimalist point of view.

not sure about the API requisite, but woxikon.com seems pretty good for me. pairs of words in en, pt, es, de, fr and it, and so far behaving very well. recommended :)
I agree mostly, except with the “pairs of words” thing. What I really want is pairs of a word and a language-neutral meaning. From that, you can make lists of word pairs if you like.
The advantage is: a) you have disambiguation if you want it, but, much more importantly, b) it takes the amount of input (work!) required from exponential (number of words raised by number of languages) to linear (number of words times number of concepts times number of languages). That’s a *huge* difference, and I think it would especially help “small” languages.
I think I understand exactly what you’re saying. My thoughts have been going in the same direction lately. As you say, there are lots of dictionaries out there, and they all like to keep their data in their own complicated and incompatible structures, sense-disambiguated and all. Data exchange is a problem.
There are a couple of data-exchange formats out there, including TBX, OLIF, and the ISO standards TMF and LMF. But their problem is that they’re complicated and take-up hasn’t been great.
What’s needed is an embarrasingly simple standard, something along the lines of the pairs-of-words thing you are suggesting. True, it wouldn’t do justice to the beautiful complexity of language, but it would greatly facilitate data exchange. An on-line dictionary could keep its data in its own complicated structure for its own users, but also make it available for data exchange in this simplified format.
I am currently working on a redesign of my Irish-English dictionary, Pota Focal. It will store its data in a fairly complex XML structure, which I designed myself and which is not compatible with anything else. But, I’d be quite happy to offer an API that feeds a grossly simplified version, let’s say as a machine-readable web service or an RSS feed.
It would be great if a standardised file format existed. Now, what should it be called? The Really Simple Dictionary Format? RSDF? Does that sound sexy enough?
I have a vision of a community of on-line dictionaries, each showing it’s own data to its own users in its own format, but also consuming feeds from other dictionaries and showing these alongside.
P.S. (to Brightbyte): Language-neutral meaning is surprisingly elusive. It’s devilishly difficult to pinpoint even for a pair of languages, never mind for several. I’d not go that way.
@Mírian: I was a bit confused by your comment, until I realized that surely you meant woxikon.com! I took the liberty of changing your comment, if only because woxicon.com is a spammy domain squatter.
Anyway, thanks for the very interesting link; it’s an impressive research. Autocomplete in multiple languages, verb forms galore, synonyms, etc… It’s a very nice site. I don’t see any evidence of an API or free licensing, however.
For the past 3 years, I have been part of a team of researchers that observes and interviews translators while they are doing their normal day to day work, with the aim of better understanding their workpractices and how technology could best support them.
I agree with your assessment. In my observation, translators are mostly interested in seeing a list of possible translations for a particular term or expression. More detailed information like pronunciation, etymology, and carefully weeded sense disambiguation is often not looked at by the translator. In the end, most translators seem to rely on their own judgement and domain knowledge to choose which, among the possible suggestions, is most appropriate for their current needs. I disagree however regarding example sentences. In my observation, that’ s the one additional information that translators do look at, in order to figure out in what contexts a particular translation can be used.
Another argument for the simple word pair approach, is that it’s fast for the translators to create them. Our observation of translators at work shows that they seldom take the time to create entries in terminology databases. When they do, it’s often in the form of word or term pairs, with no additional information. Translators are usually paid by the number of words translated, and often work under tight deadline. So they don’t have a tendancy to spend too much time on actitivities that do not immediately pay off (for example, writing a terminological entry for a term that you probably won’t need for a month or so). These observations are also in line with what Lynn Bowker at OttawaU observed when she looked at the terminology management practices of translators.
Note that this does not mean that additional information is not a good idea. It just means that:
* You won’t get that information from translators
* You need to allow translators to view a simplified entry that does not contain extraneous information like pronunciation and etymology (otherwise, the entry is not as compact and easy to consult).
Alain Désilets
Here’s a convention: Text file, one word pair per line, words separated by a tab.
PS: Patrick, a word seems to be missing in the post. They can apply their refined powers of…?
@Ke: That’s also a good convention, I think, but the problem with tabs is that they’re aren’t visible like newlines are. If you have a line like:
It’s all too easy to lose track of where the tabs are in that second line. As a case in point, I can’t put tabs into this text area! That’s why I prefer:
I’ve got a post to this effect here: Hacklog: Blogamundo » Blog Archive » Dictionaries for Minimalists
& thanks for pointing out the typo, fixed.
@Brightbyte, MBM, Alain: Thanks for your comments! I hope we can keep this conversation going. I’m working on another post which will reply to your observations.
Incidentally, there are at least a handful of other folks who are also interested in this topic, maybe we can all get our heads together and try to make something happen.
I very strongly believe that a concept centered approach is the best option for the internal representation of dictionaries, at least if you deal with more than two languages. This is especially true if the vocabulary is to be used by machines, for example for indexing and searching.
However, for an exchange format, API and translator’s UI, a simple pair of words approach may be best (even for adding to the database). I see roughly four levels of complexity: 1) pairs of words 2) pairs of word and concept (or, derived from that, meanings for a term, and terms for those meanings) 3) a thesaurus-like structure as defined by RDF/SKOS and 4) the full featured native structure of whatever you have, represented as XML or similar (maybe also RDF-based).
So… make simple interfaces. Don’t let that keep you from building complex structures for representing all the information you have, want or need. I think OmegaWiki is a good and a bad example for this. It has a very powerful and flexible structure, but it exposes way to much of it. This, I’m told, is about to change.